pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

`from_numpy` cannot deal with nested dtype #18997

Open szsdk opened 1 month ago

szsdk commented 1 month ago

Checks

Reproducible example

dtype = np.dtype([("a", int), ("b", np.dtype([("bb", float), ("ba", int)]))])
d = np.empty(5, dtype=dtype)
pl.from_numpy(d, schema_overrides={"a": pl.Int64, "b": pl.Struct({"bb": pl.Float64, "ba": pl.Int64})})

Log output

No response

Issue description

Error:

ComputeError: cannot cast 'Object' type

Expected behavior

I will expect that from_numpy can deal with nested dtype since it is consistent with pl.Struct from my perspective.

Installed versions

--------Version info---------
Polars:              1.8.2
Index type:          UInt32
Platform:            Linux-6.10.10-arch1-1-x86_64-with-glibc2.40
Python:              3.12.6 (main, Sep  8 2024, 13:18:56) [GCC 14.2.1 20240805]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          2.2.1
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             <not installed>
pandas               2.2.2
pyarrow              17.0.0
pydantic             2.9.2
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                2.3.1
xlsx2csv             <not installed>
xlsxwriter           <not installed>
ritchie46 commented 1 month ago

I don't think this is a bug. Polars doesn't have a tuple type, or the specified numpy type. So it is read as an object as fallback. At that point Polars cannot cast them.

This could be a feature request for supporting np.dtype([("bb", float), ("ba", int)]). Is that numpy's form of structs?

cmdlineluser commented 1 month ago

There is support for writing, just not reading.

df = pl.DataFrame(schema={"a": pl.Int64, "b": pl.Struct({"bb": pl.Float64, "ba": pl.Int64})})

arr = df.to_numpy(structured=True)
# array([], dtype=[('a', '<i8'), ('b', [('bb', '<f8'), ('ba', '<i8')])])

pl.from_numpy(arr).schema
# Schema([('a', Int64), ('b', Object)]) # should round-trip?
szsdk commented 1 month ago

This workaround converts NumPy structured arrays to Polars structured series effectively. I believe it would be a valuable addition to the Polars library as polars.Series.from_numpy. I am willing to create a pull request for this but would appreciate some guidance.

import numpy as np
import polars as pl

def numpy_to_structrued_series(arr):
    def kernel(arr):
        if arr.dtype.names:
            return pl.struct(kernel(arr[n]).alias(n) for n in arr.dtype.names)
        cols.append(arr)
        return pl.col(str(len(cols) - 1))

    cols = []
    expression = kernel(arr)
    return pl.DataFrame({str(i): col for i, col in enumerate(cols)}).select(
        expression.alias("")
    )

# %%

dtype = np.dtype(
    [("a", np.int32), ("b", np.int32, 4), ("ac", np.dtype([("aca", np.float32)]))]
)
c = np.zeros(1000_000, dtype=dtype)
c = numpy_to_structrued_series(c)

# %%
dtype = np.dtype([("a", int), ("b", np.dtype([("bb", float), ("ba", int)]))])
d = np.empty(5, dtype=dtype)
numpy_to_structrued_series(d).schema

# %%
d = np.zeros(5)
numpy_to_structrued_series(d)