pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.14k stars 1.94k forks source link

Schema inference fails when colums are produced without a name with pyodbc and sql server #18526

Closed jgiacobbi closed 2 months ago

jgiacobbi commented 2 months ago

Checks

Reproducible example

import pyodbc
import polars as pl
# Replace conn_str with a connection string for your sql server database
conn_str = "..."
conn = pyodbc.connect(conn_str)
pl.read_database("select UPPER('hello'), UPPER('world')", conn)

Log output

Traceback (most recent call last):
  non polars code
  <snip>
  File "/usr/local/lib/python3.12/dist-packages/polars/io/database/functions.py", line 257, in read_database
    ).to_polars(
      ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/polars/io/database/_executor.py", line 540, in to_polars
    frame = frame_init(
            ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/polars/io/database/_executor.py", line 310, in _from_rows
    return frames if iter_batches else next(frames)  # type: ignore[arg-type]
                                       ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/polars/io/database/_executor.py", line 297, in <genexpr>
    DataFrame(
  File "/usr/local/lib/python3.12/dist-packages/polars/dataframe/frame.py", line 369, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/polars/_utils/construction/dataframe.py", line 462, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/functools.py", line 909, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/polars/_utils/construction/dataframe.py", line 670, in _sequence_of_tuple_to_pydf
    return _sequence_of_sequence_to_pydf(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/polars/_utils/construction/dataframe.py", line 562, in _sequence_of_sequence_to_pydf
    column_names, schema_overrides = _unpack_schema(
                                     ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/polars/_utils/construction/dataframe.py", line 237, in _unpack_schema
    raise ShapeError(msg)
polars.exceptions.ShapeError: data does not match the number of columns

Issue description

SQL Server does not always name modified columns. When trying to extract two columns of datetimes as dates, neither column in the result set has a name.

In the file /usr/local/lib/python3.12/dist-packages/polars/io/database/_executor.py on line 278:

                elif hasattr(self.result, "description"):
                    cursor_desc = {d[0]: d[1:] for d in self.result.description}

With some print debugging additions:

                elif hasattr(self.result, "description"):
                    print("Cursor description contents")
                    for d in self.result.description:
                        print(d)
                    print("----")
                    print("Extracting cursor_desc")
                    cursor_desc = {d[0]: d[1:] for d in self.result.description}
                    for k, v in cursor_desc.items():
                        print(f"{k, v}")
                    print("----")

Output:

Cursor description contents
('', <class 'str'>, None, 5, 5, 0, True)
('', <class 'str'>, None, 5, 5, 0, True)
----
Extracting cursor_desc
('', (<class 'str'>, None, 5, 5, 0, True))
----

As you can see since neither column has a name the key name collides and the expected shape is 1 column instead of 2.

With the query corrected to "select UPPER('hello') as one, UPPER('world') as two":

Cursor description contents
('one', <class 'str'>, None, 5, 5, 0, True)
('two', <class 'str'>, None, 5, 5, 0, True)
----
Extracting cursor_desc
('one', (<class 'str'>, None, 5, 5, 0, True))
('two', (<class 'str'>, None, 5, 5, 0, True))
----
shape: (1, 2)
┌───────┬───────┐
│ one   ┆ two   │
│ ---   ┆ ---   │
│ str   ┆ str   │
╞═══════╪═══════╡
│ HELLO ┆ WORLD │
└───────┴───────┘

Expected behavior

This either should work automatically by generating random column names, or the error should be detected at the cursor_desc stage and reported there.

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: Linux-6.5.0-1023-aws-aarch64-with-glibc2.35 Python: 3.12.4 (main, Jun 8 2024, 18:29:57) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 1.26.3 openpyxl pandas 2.2.2 pyarrow pydantic 2.8.2 pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
vmgustavo commented 2 months ago

Similar problem here in version 1.2.1. The difference I have is I was using pl.from_records instead of pl.read_database.

alexander-beedie commented 2 months ago

Yup, this should raise a DuplicateError - thanks for the report; fix incoming 👌