Open szsdk opened 1 month ago
I don't think this is a bug. Polars doesn't have a tuple type, or the specified numpy type. So it is read as an object
as fallback. At that point Polars cannot cast them.
This could be a feature request for supporting np.dtype([("bb", float), ("ba", int)])
. Is that numpy's form of structs?
There is support for writing, just not reading.
df = pl.DataFrame(schema={"a": pl.Int64, "b": pl.Struct({"bb": pl.Float64, "ba": pl.Int64})})
arr = df.to_numpy(structured=True)
# array([], dtype=[('a', '<i8'), ('b', [('bb', '<f8'), ('ba', '<i8')])])
pl.from_numpy(arr).schema
# Schema([('a', Int64), ('b', Object)]) # should round-trip?
This workaround converts NumPy structured arrays to Polars structured series effectively. I believe it would be a valuable addition to the Polars library as polars.Series.from_numpy
. I am willing to create a pull request for this but would appreciate some guidance.
import numpy as np
import polars as pl
def numpy_to_structrued_series(arr):
def kernel(arr):
if arr.dtype.names:
return pl.struct(kernel(arr[n]).alias(n) for n in arr.dtype.names)
cols.append(arr)
return pl.col(str(len(cols) - 1))
cols = []
expression = kernel(arr)
return pl.DataFrame({str(i): col for i, col in enumerate(cols)}).select(
expression.alias("")
)
# %%
dtype = np.dtype(
[("a", np.int32), ("b", np.int32, 4), ("ac", np.dtype([("aca", np.float32)]))]
)
c = np.zeros(1000_000, dtype=dtype)
c = numpy_to_structrued_series(c)
# %%
dtype = np.dtype([("a", int), ("b", np.dtype([("bb", float), ("ba", int)]))])
d = np.empty(5, dtype=dtype)
numpy_to_structrued_series(d).schema
# %%
d = np.zeros(5)
numpy_to_structrued_series(d)
Checks
Reproducible example
Log output
No response
Issue description
Error:
Expected behavior
I will expect that
from_numpy
can deal with nested dtype since it is consistent withpl.Struct
from my perspective.Installed versions