Open Wainberg opened 4 months ago
@stinodego can this one be on your stack somewhere?
Before going into the nitty gritty: you can use None
for null values in object arrays to get the behavior you want:
arr = np.array(["foo", None], dtype=object)
res = pl.from_numpy(arr)
print(res)
shape: (2, 1)
┌──────────┐
│ column_0 │
│ --- │
│ str │
╞══════════╡
│ foo │
│ null │
└──────────┘
So what should the result be when the input is ["foo", float("nan")]
? I guess an Object Series with the objects "foo" and float("nan")
. We should definitely not assume float("nan")
is a null value for object types, unless I guess nan_to_null=True
.
The reason this worked on earlier patches is that we would just set any non-string input to null (e.g. ["hello", 1234]
would also result in ["hello", null]
). Now we raise here. I don't think the previous nulling behavior is good either.
Currently, for object types, we check if the first non-null is a string or bytes object and then assume the rest is as well: https://github.com/pola-rs/polars/blob/2b54214f862d2792e393fa00edb0fdf238f3d330/py-polars/polars/datatypes/constructor.py#L136-L144
We can just remove the logic but then ["a", "b"]
will be parsed as an Object Series.
I guess we need to do a full scan over the Series to check if all objects are string or bytes?
EDIT: Discussed with Ritchie - we'll have to expose infer_schema_length
and figure out that way if it's a string column or not - rather than taking the first-non-null.
@stinodego I'm looking to start contributing to Polars, would this be a suitable first ticket to start with? It seems not too large and scoped clearly. Would appreciate any advice!
Checks
Reproducible example
Log output
No response
Issue description
NumPy string arrays with missing values are often represented with
dtype=object
andnp.nan
for missing values. This gives an error when converting to polars, which I don't think was there pre-1.0.Expected behavior
This should convert to
pl.Series(['foo', None])
.Installed versions