pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.02k stars 1.72k forks source link

`write_database` fails for UInts and Time dtypes when ADBC used #17312

Open mcrumiller opened 2 weeks ago

mcrumiller commented 2 weeks ago

Checks

Reproducible example

Here is a check that goes through all of the basic (non-nested) dtypes:

from datetime import date, time
from datetime import datetime as dt
from datetime import timedelta as td

from adbc_driver_postgresql import dbapi

import polars as pl

uri = "postgresql://mcrumiller@winhost/mydb"
conn = dbapi.connect(uri)

def try_write(data, dtype):
    """Attempt to write single-column df using ADBC."""
    df = pl.DataFrame({"a": pl.Series(data, dtype=dtype)})
    try:
        df.write_database("test_dtype", conn, if_table_exists="replace")
    except Exception:
        print(f"{dtype} FAIL")

# these all fail
try_write([1, 2, 3], dtype=pl.UInt8)  # FAIL!
try_write([1, 2, 3], dtype=pl.UInt16)  # FAIL!
try_write([1, 2, 3], dtype=pl.UInt32)  # FAIL!
try_write([1, 2, 3], dtype=pl.UInt64)  # FAIL!
try_write([time(1), time(2), time(3)], pl.Time)  # FAIL!

# these all pass
try_write([1, 2, 3], dtype=pl.Int8)
try_write([1, 2, 3], dtype=pl.Int16)
try_write([1, 2, 3], dtype=pl.Int32)
try_write([1, 2, 3], dtype=pl.Int64)
try_write([True, True, False], pl.Boolean)
try_write(["a", "b", "c"], pl.String)
try_write(["a", "b", "c"], pl.Categorical)
try_write(["a", "b", "c"], pl.Enum(["a", "b", "c"]))
try_write([date(2024, 1, 1), date(2024, 1, 2), date(2024, 1, 3)], pl.Date)
try_write([dt(2024, 1, 1), dt(2024, 1, 2), dt(2024, 1, 3)], pl.Datetime)
try_write([td(1), td(2), td(3)], pl.Duration)

Log output

UInt8 FAIL
UInt16 FAIL
UInt32 FAIL
UInt64 FAIL
Time FAIL

Due to:

adbc_driver_manager.NotSupportedError: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint8

Issue description

df.write_database(...) fails for certain dtypes because adbc does not support those types. We should probably upcast where possible, e.g. u8 -> i16, u16 -> i32, u64 -> i64 if possible, warn if overflow.

Installed versions

main

mcrumiller commented 2 weeks ago

@alexander-beedie

alexander-beedie commented 2 weeks ago

Hmm... might want to bring it to the attention of the ADBC folks too, since presumably it'll affect all client connections/data (eg: non-polars callers uploading pyarrow, etc) 🤔

mcrumiller commented 2 weeks ago

might want to bring it to the attention of the ADBC folks too

I'm writing up an issue for them, but figured we should have a workaround in the meantime.

mcrumiller commented 2 weeks ago

https://github.com/apache/arrow-adbc/issues/1950

alexander-beedie commented 1 week ago

I'm writing up an issue for them, but figured we should have a workaround in the meantime.

Looks like they're moving to fix this - better that than applying a workaround on our side. Let's see if it makes their next release 👌

mcrumiller commented 1 week ago

@alexander-beedie I think we may want to at least put in a version check along with a warning telling the user to upcast, or do an automatic upcast ourselves based on adbc version. Their Milestone 14 isn't due until mid-Aug. Not that that's super far away, but polars moves quickly :)