pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

`from_numpy` does not support object-dtyped string arrays with missing values #17484

Open Wainberg opened 4 months ago

Wainberg commented 4 months ago

Checks

Reproducible example

>>> pl.Series(np.array(['foo', float('nan')], dtype=object))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../polars/series/series.py", line 299, in __init__
    self._s = numpy_to_pyseries(
              ^^^^^^^^^^^^^^^^^^
  File ".../polars/_utils/construction/series.py", line 442, in numpy_to_pyseries
    return constructor(
           ^^^^^^^^^^^^
TypeError: 'float' object cannot be converted to 'PyString'
>>> pl.from_numpy(np.array(['foo', float('nan')], dtype=object))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../polars/convert/general.py", line 346, in from_numpy
    numpy_to_pydf(
  File ".../polars/_utils/construction/dataframe.py", line 1298, in numpy_to_pydf
    pl.Series(
  File ".../polars/series/series.py", line 299, in __init__
    self._s = numpy_to_pyseries(
              ^^^^^^^^^^^^^^^^^^
  File ".../polars/_utils/construction/series.py", line 442, in numpy_to_pyseries
    return constructor(
           ^^^^^^^^^^^^
TypeError: 'float' object cannot be converted to 'PyString'

Log output

No response

Issue description

NumPy string arrays with missing values are often represented with dtype=object and np.nan for missing values. This gives an error when converting to polars, which I don't think was there pre-1.0.

Expected behavior

This should convert to pl.Series(['foo', None]).

Installed versions

``` Replace this line with the output of pl.show_versions(). Leave the backticks in place. ```
ritchie46 commented 4 months ago

@stinodego can this one be on your stack somewhere?

stinodego commented 4 months ago

Before going into the nitty gritty: you can use None for null values in object arrays to get the behavior you want:

arr = np.array(["foo", None], dtype=object)
res = pl.from_numpy(arr)
print(res)
shape: (2, 1)
┌──────────┐
│ column_0 │
│ ---      │
│ str      │
╞══════════╡
│ foo      │
│ null     │
└──────────┘

So what should the result be when the input is ["foo", float("nan")]? I guess an Object Series with the objects "foo" and float("nan"). We should definitely not assume float("nan") is a null value for object types, unless I guess nan_to_null=True.

The reason this worked on earlier patches is that we would just set any non-string input to null (e.g. ["hello", 1234] would also result in ["hello", null]). Now we raise here. I don't think the previous nulling behavior is good either.

Currently, for object types, we check if the first non-null is a string or bytes object and then assume the rest is as well: https://github.com/pola-rs/polars/blob/2b54214f862d2792e393fa00edb0fdf238f3d330/py-polars/polars/datatypes/constructor.py#L136-L144

We can just remove the logic but then ["a", "b"] will be parsed as an Object Series.

I guess we need to do a full scan over the Series to check if all objects are string or bytes?

EDIT: Discussed with Ritchie - we'll have to expose infer_schema_length and figure out that way if it's a string column or not - rather than taking the first-non-null.

Baukebrenninkmeijer commented 2 weeks ago

@stinodego I'm looking to start contributing to Polars, would this be a suitable first ticket to start with? It seems not too large and scoped clearly. Would appreciate any advice!