pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.09k stars 1.94k forks source link

pl.from_numpy produces column with null dtype when input array is empty #18310

Open danpatton opened 2 months ago

danpatton commented 2 months ago

Checks

Reproducible example

import numpy as np
import polars as pl
from polars.testing import assert_frame_equal

data = np.zeros((0,))
assert_frame_equal(pl.from_numpy(data, schema=["foo"]), pl.Series("foo", data).to_frame())

Log output

Traceback (most recent call last):
  File "repro.py", line 6, in <module>
    assert_frame_equal(pl.from_numpy(data, schema=["foo"]), pl.Series("foo", data).to_frame())
  File ".venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".venv/lib/python3.11/site-packages/polars/testing/asserts/frame.py", line 91, in assert_frame_equal
    _assert_frame_schema_equal(
  File ".venv/lib/python3.11/site-packages/polars/testing/asserts/frame.py", line 187, in _assert_frame_schema_equal
    raise_assertion_error(objects, detail, left_schema_dict, right_schema_dict)
  File ".venv/lib/python3.11/site-packages/polars/testing/asserts/utils.py", line 12, in raise_assertion_error
    raise AssertionError(msg) from cause
AssertionError: DataFrames are different (dtypes do not match)
[left]:  {'foo': Null}
[right]: {'foo': Float64}

Issue description

If you change the shape of data to (1,) then the assertion passes

Expected behavior

Assertion passes when the shape of the data is (0,)

Installed versions

``` Polars: 1.5.0 Index type: UInt32 Platform: Linux-6.5.0-44-generic-x86_64-with-glibc2.35 Python: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 11.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 2.1.0 openpyxl: pandas: pyarrow: pydantic: pyiceberg: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
deanm0000 commented 2 months ago

The issue is that pl.from_numpy(data, schema=["foo"]) returns a 0,0 dataframe. Since a dataframe is just a collection of Series, having 0 columns means no Series to store dtype info.

A Series doesn't have columns, it is a column, so when you do pl.Series("foo", data) it has no issue stashing the dtype in itself.

@alexander-beedie is there a workable solution here? Maybe make from_numpy return (0,1) instead of (0,0)?

ion-elgreco commented 2 months ago

If you explicitly provide a schema with the dtypes this is not an issue

danpatton commented 2 months ago

If you explicitly provide a schema with the dtypes this is not an issue

I tried this, but it gives the same behaviour:

assert_frame_equal(pl.from_numpy(data, schema={"foo": pl.Float32}), pl.Series("foo", data).to_frame())