tlocke / pg8000

A Pure-Python PostgreSQL Driver
BSD 3-Clause "New" or "Revised" License
515 stars 46 forks source link

Not working with geopandas #75

Closed Jesse-jApps closed 3 years ago

Jesse-jApps commented 3 years ago

When trying to use pg8000 with geopandas I get the following error.

/opt/conda/lib/python3.7/site-packages/geopandas/io/sql.py in _psql_insert_copy(tbl, conn, keys, data_iter)
    306 
    307     dbapi_conn = conn.connection
--> 308     with dbapi_conn.cursor() as cur:
    309         sql = 'COPY "{}"."{}" ({}) FROM STDIN WITH CSV'.format(
    310             tbl.table.schema, tbl.table.name, columns

AttributeError: __enter__

At the moment the class Cursor has not __enter__ . Would be nice to add it.

def __enter__(self):
        return self

def __exit__(self, exc_type, exc_value, traceback):
    self.close()

def copy_expert(self, sql, file):
    import io
    stream_in = io.BytesIO(file.read().encode('utf8'))
    stream_in.seek(0)
    self.execute(sql, stream=stream_in)
tlocke commented 3 years ago

Hi @Jesse-jApps, I couldn't reproduce that behaviour. Here's my test using SQLAlchemy==1.4.20 and pg8000==1.19.5:

from sqlalchemy import create_engine

engine = create_engine("postgresql+pg8000://postgres:cpsnow@localhost/test")
connection = engine.connect()
db_con = connection.connection
with db_con.cursor() as cur:
    cur.execute("select typname from pg_type")
    for row in cur.fetchall():
        print(row[0])

Are you able to provide a minimum reproducible example of it not working?

Also, IIRC the DBAPI specification doesn't include context management methods, so I'd argue that geopandas shouldn't be relying on anything outside the specification.

Jesse-jApps commented 3 years ago

Hi @tlocke, sorry for the sparse info. I will try to provide more details. I am using the alpha version of google.cloud.sql.connector to create a connection from within a jupyterlab notebook in the cloud. To make it work, I use

import pg8000
import sqlalchemy
from google.cloud.sql.connector import connector

def getconn() -> pg8000.dbapi.Connection:
    conn: pg8000.dbapi.Connection = connector.connect(
        'host', 
        "pg8000",
        user='user',
        password='password',
        db='database'
    )
    return conn

engine = sqlalchemy.create_engine(
    "postgresql+pg8000://",
    creator=getconn,
)

Therefore I will get the dbapi cursor. I haven't tried using the legacy connection yet, but since it will be deprecated soon, i guess it should be the dbapi anyway. I am quit surprised, that the DBAPI 2.0 specifications do not define context management methods. Since there is a close()-method, I thought that would be a logical step.

But in that case you are right, that the changes should be made in geopandas, which by the way expects the cursor to have a copy_expert()-method in order to do something like e.g.

gdf = gpd.read_file('gs://bucket/file.shp')
gdf.to_postgis('table_name', engine)

Additionally, while geopandas uses a StringIO-Buffer to execute the COPY-Statement, pg8000 dbapi-cursor expects a BytesIO-Buffer. I am not sure how it should be or which is the correct way to do it.

In def execute(...) it says "For a COPY FROM the parameter must be a readable file-like object, and for COPY TO it must be writable". At least the information and an assert should be added that the file-like object should be bytes?

tlocke commented 3 years ago

Therefore I will get the dbapi cursor.

That makes sense. The dbapi cursor strictly follows the DBAPI spec and so doesn't have a context manager.

I haven't tried using the legacy connection yet, but since it will be deprecated soon, i guess it should be the dbapi anyway.

Yes, the legacy API will be deprecated at some point.

I am quit surprised, that the DBAPI 2.0 specifications do not define context management methods. Since there is a close()-method, I thought that would be a logical step.

I think the reason is that the DBAPI 2 spec is over 20 years old, and maybe Python didn't even have context managers then? I've got a vague plan to write a rough DBAPI 3 spec to see what people think. I've never written a spec before, so maybe I'd be biting off more than I can chew :-)

Additionally, while geopandas uses a StringIO-Buffer to execute the COPY-Statement, pg8000 dbapi-cursor expects a BytesIO- Buffer. I am not sure how it should be or which is the correct way to do it.

From the PostgreSQL docs on COPY, it uses a bytes stream with the default encoding of the CLIENT_ENCODING PostgreSQL setting. It provides an ENCODING parameter to COPY if you want to override that. You might say, why doesn't pg8000 handle all that behind the scenes and use a StringIO buffer? My answer is that COPY also supports a binary format, so pg8000 has to work for that too. Actually, we could do something a bit more friendly here, so I've added an issue for this https://github.com/tlocke/pg8000/issues/76.

tlocke commented 3 years ago

So with the latest pg8000 release 1.20.0 you should be able to use a StringIO buffer when doing a COPY. You can still use a BytesIO if you like though.

jimdigriz commented 3 years ago

@Jesse-jApps you can monkey patch your application with something like the following:

class Cursor(pg8000.dbapi.Cursor):
    def __enter__(self):
        return self
    def __exit__(self, type, value, traceback):
        self.close();
pg8000.dbapi.Cursor = Cursor

@tlocke looks like from https://github.com/tlocke/pg8000/issues/75#issuecomment-870688951 you would not be interested in a PR to add this and plan to stick with DB-API 'purity'?

tlocke commented 3 years ago

Hi @jimdigriz, pg8000 has two APIs, pg8000.native and pg8000.dbapi. The idea is that pg8000.dbapi strictly follows the DBAPI 2 specification, and the aim of pg8000.native is to provide a convenient API with all the mod cons (such as a context manager for the connection object).

tlocke commented 3 years ago

I'll close this for now, but please do re-open if needed.