pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.26k stars 1.85k forks source link

import numpy with initial null value #17913

Open dpinol opened 1 month ago

dpinol commented 1 month ago

Checks

Reproducible example

pl.Series("a", np.array([None, "3"]),pl.String)
# and also
pl.from_numpy(np.array([[None],["3"]]),{"a":pl.String})

Log output

Traceback (most recent call last):
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.12/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-53-9ea226313f8a>", line 1, in <module>
    pl.Series("a", np.array([None, "as"]),pl.String)
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.12/site-packages/polars/series/series.py", line 319, in __init__
    self._s = self.cast(dtype, strict=strict)._s
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.12/site-packages/polars/series/series.py", line 3992, in cast
    return self._from_pyseries(self._s.cast(dtype, strict, wrap_numerical))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: cannot cast 'Object' type

Issue description

When the first value is null, numpy columns cannot be imported, even when the schema is specified. It's not exactly the same as https://github.com/pola-rs/polars/issues/17484, because in that one the first value is a nan. Currently the workaround consists on converting the data to python list, but apart from being inefficient, as specified, it disables nan_to_null=True pl.Series("a", np.array([None, "3"]).tolist(),pl.String64)

Expected behavior

It should create a pl.String column with a null and a string values.

Installed versions

``` pl.show_versions() --------Version info--------- Polars: 1.2.1 Index type: UInt32 Platform: Linux-6.8.0-31-generic-x86_64-with-glibc2.39 Python: 3.12.3 (main, Apr 11 2024, 10:16:04) [GCC 13.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.6.1 gevent: great_tables: hvplot: matplotlib: 3.9.1 nest_asyncio: 1.6.0 numpy: 1.26.4```
coastalwhite commented 1 month ago

This is because numpy arrays are not nullable. When you put [None] into a numpy array, it gets converted to an numpy array of python objects.

This is not really polars' fault.

dpinol commented 1 month ago

@coastalwhite thanks for your quick answer! It's a pity because having None in the first element is the only case which fails. Otherwise, it works, even with strict=True.


pl.Series("a", np.array(["3",None, "3"]),pl.String, strict=True)
Out[64]:
shape: (3,)
Series: 'a' [str]
[
    "3"
    null
    "3"
]

Do you see any workaround? Using NAN does not work either.

pl.Series("a", np.array(["3",np.nan, "3"], np.object_),pl.String, nan_to_null=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[65], line 1
----> 1 pl.Series("a", np.array(["3",np.nan, "3"], np.object_),pl.String, nan_to_null=True)

File ~/.local/share/virtualenvs/sa2-CjZGZvYy/lib/python3.12/site-packages/polars/series/series.py:300, in __init__(self, name, values, dtype, strict, nan_to_null)
    297         dtype = pl_dtype
    299 # Handle case where values are passed as the first argument
--> 300 original_name: str | None = None
    301 if name is None:
    302     name = ""

File ~/.local/share/virtualenvs/sa2-CjZGZvYy/lib/python3.12/site-packages/polars/_utils/construction/series.py:455, in numpy_to_pyseries(name, values, strict, nan_to_null)
    453 elif not hasattr(array, "num_chunks"):
    454     pys = PySeries.from_arrow(name, array)
--> 455 else:
    456     if array.num_chunks > 1:
    457         # somehow going through ffi with a structarray
    458         # returns the first chunk every time
    459         if isinstance(array.type, pa.StructType):

TypeError: 'float' object cannot be converted to 'PyString'