pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.51k stars 1.98k forks source link

Cannot write to Azure Postgre SQL DB #7757

Open mavestergaard opened 1 year ago

mavestergaard commented 1 year ago

Polars version checks

Issue description

I am trying to write from Polars to an Azure Postgre SQL Database.

I have tried a few different things and run into problems all the time.

1) with engine="sqlalchemy", i get the below error --> Seems to be a Pandas related issue

error: ``` df.write_database( File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/polars/dataframe/frame.py", line 2961, in write_database self.to_pandas(use_pyarrow_extension_array=True).to_sql( File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/pandas/core/generic.py", line 2987, in to_sql return sql.to_sql( File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/pandas/io/sql.py", line 695, in to_sql return pandas_sql.to_sql( File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/pandas/io/sql.py", line 1738, in to_sql total_inserted = sql_engine.insert_records( File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/pandas/io/sql.py", line 1325, in insert_records return table.insert(chunksize=chunksize, method=method) File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/pandas/io/sql.py", line 924, in insert keys, data_list = self.insert_data() File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/pandas/io/sql.py", line 892, in insert_data d = vals.to_pydatetime() AttributeError: 'ArrowExtensionArray' object has no attribute 'to_pydatetime' ```

2) Trying with engine='adbc' -> I get the error below: --> Can be fixed and should be updated in the code base as shown with example below

error: ``` df.write_database( File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/polars/dataframe/frame.py", line 2940, in write_database with _open_adbc_connection(connection_uri) as conn: File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/polars/io/database.py", line 266, in _open_adbc_connection return adbc.connect(connection_uri) File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/adbc_driver_postgresql/dbapi.py", line 101, in connect db = adbc_driver_postgresql.connect(uri) File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/adbc_driver_postgresql/__init__.py", line 29, in connect return adbc_driver_manager.AdbcDatabase(driver=_driver_path(), uri=uri) File "adbc_driver_manager/_lib.pyx", line 551, in adbc_driver_manager._lib.AdbcDatabase.__init__ File "adbc_driver_manager/_lib.pyx", line 381, in adbc_driver_manager._lib.check_error adbc_driver_manager._lib.OperationalError: ADBC_STATUS_IO (10): [libpq] Failed to connect: invalid integer value "pw@servername:5432" for connection option "port" ```
fixed adbc connection string: ``` sqlalchemy_connection_string = f"postgresql://{username}:%s@{servername}:5432/{dbname}" % parse.quote_plus(pw) adbc_connection_string =f"postgresql://{servername}:5432/{dbname}?user={username}&password={pw}" ```

3) After fixing the connection string, I still get this error:

error: ``` df.write_database( File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/polars/dataframe/frame.py", line 2940, in write_database with _open_adbc_connection(connection_uri) as conn: File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/adbc_driver_manager/dbapi.py", line 217, in __exit__ self.close() File "/home/mv/.virtualenvs/env_fsspec/lib/python3.10/site-packages/adbc_driver_manager/dbapi.py", line 302, in close self._conn.close() File "adbc_driver_manager/_lib.pyx", line 851, in adbc_driver_manager._lib.AdbcConnection.close File "adbc_driver_manager/_lib.pyx", line 452, in adbc_driver_manager._lib._AdbcHandle._check_open_children File "adbc_driver_manager/_lib.pyx", line 454, in adbc_driver_manager._lib._AdbcHandle._check_open_children RuntimeError: Cannot close AdbcConnection with open AdbcStatement ```

Can you please assist? And help me how to write from polars to postgre db!

Many thanks!

Reproducible example

import polars as pl
from datetime import datetime
data = {
    'text': ['foo', 'bar', 'baz'],
    'bool': [True, False, True],
    'datetime': [datetime.now(), datetime.now(), datetime.now()],
    'date': ['2022-03-24', '2022-03-23', '2022-03-22'],
    'float': [1.1, 2.2, 3.3]
}
df = pl.DataFrame(data, schema={
    'text': pl.Utf8,
    'bool': pl.Boolean,
    'datetime': pl.Datetime('us', None),
    'date': pl.Date,
    'float': pl.Float64
})

sqlalchemy_connection_string = f"postgresql://{username}:{pw}@{servername}:5432/{dbname}"
adbc_connection_string = f"postgresql://{servername}:5432/{dbname}?user={username}&password={pw}"
df.write_database(
            table_name="schema.my_table",
            connection_uri=adbc_connection_string ,
            if_exists='append',
            engine='adbc'
        )
df.write_database(
            table_name="schema.my_table",
            connection_uri=sqlalchemy_connection_string ,
            if_exists='append',
            engine='adbc'
        )

Expected behavior

Expecting the code to write to the database

Installed versions

``` ---Version info--- Polars: 0.16.14 Index type: UInt32 Platform: Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31 Python: 3.10.9 (main, Dec 7 2022, 01:12:00) [GCC 9.4.0] ---Optional dependencies--- numpy: 1.23.4 pandas: 1.5.3 pyarrow: 11.0.0 connectorx: 0.3.1 deltalake: fsspec: 2021.07.0 matplotlib: 3.7.1 xlsx2csv: xlsxwriter: ```
lidavidm commented 1 year ago

For ADBC, we discussed this on https://github.com/apache/arrow-adbc/issues/538 and I will look at the following:

lidavidm commented 1 year ago

FWIW, the issue is because the PostgreSQL driver only has fairly basic type support (reference), but also, the polars code could suppress this exception by wrapping the cursor in a context manager to close it after use.

lidavidm commented 1 year ago

And I noticed that the if_exists modes are slightly mismatched, here's an issue to improve that on our side too: https://github.com/apache/arrow-adbc/issues/541

mavestergaard commented 1 year ago

One more thing on this issue. It would be super useful to add a "schema" input variable. This is possible with pandas.DataFrame.to_sql().

With write_dataframe() I can only reach the public schema in my PostgreSQL DB