pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.2k stars 1.95k forks source link

`read_database` does not respect data type when all values are null (but `read_database_uri` does) #15978

Open mcrumiller opened 6 months ago

mcrumiller commented 6 months ago

Checks

Issue Description

When pl.read_database is used and all values are null in a SQL column, a pl.Null column is returned, regardless of the data type in the database (example here with MSSQL). When read_database_uri is used, the proper dtype is returned. This may only apply to sqlalchemy connections.

In this example, I have five columns with different dtypes each with a single null value. I've provided the setup below to help reproduce the example.

with read_database_uri: correct dtype returned

import polars as pl

uri = "..."  # my URI
print(pl.read_database_uri("SELECT * FROM TEST_NULLS", uri))
shape: (1, 5)
┌─────────┬───────────┬──────────┬─────────────┬──────────┐
│ int_col ┆ float_col ┆ char_col ┆ varchar_col ┆ date_col │
│ ---     ┆ ---       ┆ ---      ┆ ---         ┆ ---      │
│ i64     ┆ f64       ┆ str      ┆ str         ┆ date     │
╞═════════╪═══════════╪══════════╪═════════════╪══════════╡
│ null    ┆ null      ┆ null     ┆ null        ┆ null     │
└─────────┴───────────┴──────────┴─────────────┴──────────┘

with read_database/sqlalchemy: incorrect dtype returned

import polars as pl
import sqlalchemy as sa

uri = "..."  # my URI
conn = sa.create_engine(uri).connect()
print(pl.read_database("SELECT * FROM TEST_NULLS", conn))
shape: (1, 5)
┌─────────┬───────────┬──────────┬─────────────┬──────────┐
│ int_col ┆ float_col ┆ char_col ┆ varchar_col ┆ date_col │
│ ---     ┆ ---       ┆ ---      ┆ ---         ┆ ---      │
│ null    ┆ null      ┆ null     ┆ null        ┆ null     │  # note all null dtypes
╞═════════╪═══════════╪══════════╪═════════════╪══════════╡
│ null    ┆ null      ┆ null     ┆ null        ┆ null     │
└─────────┴───────────┴──────────┴─────────────┴──────────┘

Setup

Create a table with different data types, with only null values

CREATE TABLE TEST_NULLS(
  int_col INT NULL,
  float_col FLOAT NULL,
  char_col CHAR(5) NULL,
  varchar_col VARCHAR(5) NULL,
  date_col DATE NULL,
);
INSERT INTO TEST_NULLS VALUES (NULL, NULL, NULL, NULL, NULL);

Check values:

SELECT * FROM TEST_NULLS;
SELECT COLUMN_NAME, DATA_TYPE FROM information_schema.columns WHERE table_name = 'TEST_NULLS'

image

We have five columns each with a single null value.

Installed versions

``` --------Version info--------- Polars: 0.20.23 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.7 (tags/v3.11.7:fa7a6f2, Dec 4 2023, 19:24:49) [MSC v.1937 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: 0.3.2 deltalake: fastexcel: 0.10.4 fsspec: gevent: hvplot: matplotlib: 3.8.2 nest_asyncio: numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.1.4 pyarrow: 16.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: 2.0.23 xlsx2csv: 0.8.2 xlsxwriter: 3.1.9 ```
alexander-beedie commented 6 months ago

Which driver are you using to connect? Can I see the alchemy URI prefix? (Could be odbc or native, and they may have different behaviour)

mcrumiller commented 6 months ago

I'm using pymssql for the MSSQL connection (uri mssql+pymssql://server/database).

However, I just checked on postgres (postgresql://user:pass@server:5432/database) and the same thing occurs:

# retrieves proper dtype
print(pl.read_database_uri(sql, uri))
# shape: (1, 5)
# ┌─────────┬───────────┬──────────┬─────────────┬──────────┐
# │ int_col ┆ float_col ┆ char_col ┆ varchar_col ┆ date_col │
# │ ---     ┆ ---       ┆ ---      ┆ ---         ┆ ---      │
# │ i32     ┆ f64       ┆ str      ┆ str         ┆ date     │
# ╞═════════╪═══════════╪══════════╪═════════════╪══════════╡
# │ null    ┆ null      ┆ null     ┆ null        ┆ null     │
# └─────────┴───────────┴──────────┴─────────────┴──────────┘

# returns null dtype
print(pl.read_database(sql, conn))
# shape: (1, 5)
# ┌─────────┬───────────┬──────────┬─────────────┬──────────┐
# │ int_col ┆ float_col ┆ char_col ┆ varchar_col ┆ date_col │
# │ ---     ┆ ---       ┆ ---      ┆ ---         ┆ ---      │
# │ null    ┆ null      ┆ null     ┆ null        ┆ null     │
# ╞═════════╪═══════════╪══════════╪═════════════╪══════════╡
# │ null    ┆ null      ┆ null     ┆ null        ┆ null     │
# └─────────┴───────────┴──────────┴─────────────┴──────────┘
alexander-beedie commented 6 months ago

Ok, I know what this is - I haven't got around to doing direct driver module introspection to back-out custom cursor description type codes to further-infer Polars dtypes yet. At the moment only standard python types or type strings get inferred out of the cursor result object; custom driver-specific type codes aren't handled yet, so if you have no data (or no typed data - eg: all null) we can't determine the column dtype.

It's something of a nightmare (_"welcome to the wild, wild world of the DBAPI2 type_code attribute"_), but I've done it before in my "real job" so I know where the bodies are buried ;)

mcrumiller commented 6 months ago

This is probably a rare case that I pre-empted into my testing anyway. I appreciate the help though!

alexander-beedie commented 6 months ago

This is probably a rare case that I pre-empted into my testing anyway. I appreciate the help though!

I'm actually most of the way to implementing a Polars-specific driver module type_code reverse-lookup/translation; expect a PR to land shortly! ✌️ #blackmagic