pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

`strict` parameter not applied correctly during row-wise DataFrame construction #17375

Open josiahlund opened 3 months ago

josiahlund commented 3 months ago

Checks

Reproducible example

I'm new to using Polars and have run across a few things that have been confusing to me regarding type casing and strictness of type consistency. I'm sure that some of the confusion here is rooted in my own inexperience with how things are supposed to be used, but I do think that there's at a minimum, an opportunity to improve the clarity of documentation, if not the consistency of behaviors between DataFrame and Series.

From the documentation of DataFrame, it is clear to me that when using strict=False, an input that is unsuccessfully cast to my called out data type will be replaced by null.

strictbool, default True

    Throw an error if any data value does not exactly match the given or inferred data type for that column.
    If set to False, values that do not match the data type are cast to that data type or, if casting is not
    possible, set to null instead.

I (incorrectly) inferred that with the default behavior of strict=True, an error would be thrown if a type cast failed. Here's an example of what I thought should fail working:

Input:

pl.DataFrame([{"x": ["10", "twenty", "30"]}], schema={"x": pl.List(pl.Int64)})  # Casts the strings to integers

Output:

shape: (1, 1)
┌────────────────┐
│ x              │
│ ---            │
│ list[i64]      │
╞════════════════╡
│ [10, null, 30] │
└────────────────┘

Log output

No response

Issue description

In further explorations, here are the other behaviors that I've been left scratching my head at:

Behavioral differences between Series and DataFrame

Series

Attempting to create a series of floats from string inputs (strict=true) raises a TypeError.

Input:

pl.Series([["10", "20", "30"]], dtype=pl.List(pl.Float64))  # Raises TypeError, recommends setting `strict=False`
```python-traceback TypeError: must be real number, not str During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 4, in pl.Series([["10", "20", "30"]], dtype=pl.List(pl.Float64)) # Raises TypeError, recommends setting `strict=False` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "\.venv\Lib\site-packages\polars\series\series.py", line 287, in __init__ self._s = sequence_to_pyseries( ^^^^^^^^^^^^^^^^^^^^^ File "\.venv\Lib\site-packages\polars\_utils\construction\series.py", line 240, in sequence_to_pyseries else sequence_to_pyseries( ^^^^^^^^^^^^^^^^^^^^^ File "\.venv\Lib\site-packages\polars\_utils\construction\series.py", line 134, in sequence_to_pyseries pyseries = _construct_series_with_fallbacks( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "\.venv\Lib\site-packages\polars\_utils\construction\series.py", line 301, in _construct_series_with_fallbacks return PySeries.new_from_any_values_and_dtype( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: unexpected value while building Series of type Float64; found value of type String: "10" Hint: Try setting `strict=False` to allow passing data with mixed types. ```

DataFrame

Attempting to create a DataFrame of floats from string inputs (strict=true) casts the strings to floats.

Input:

pl.DataFrame([{"x": ["10", "20", "30"]}], schema={"x": pl.List(pl.Float64)})  # Casts the strings to floats

Output:

shape: (1, 1)
┌────────────────────┐
│ x                  │
│ ---                │
│ list[f64]          │
╞════════════════════╡
│ [10.0, 20.0, 30.0] │
└────────────────────┘

Selective Strictness

So Series are totally rigid in their type strictness? Nope. A Series will cast from int to float. (Does this work because int is a subclass of float?)

Input:

pl.Series([[10, 20, 30]], dtype=pl.List(pl.Float64))  # Casts the integers to floats

Output:

shape: (1,)
Series: '' [list[f64]]
[
    [10.0, 20.0, 30.0]
]

Expected behavior

As somebody totally new to Polars, I would expect the type-casting behavior between DataFrame and Series to be consistent. I do appreciate the flexibility that comes as a result of DataFrame attempting to cast types and that even with string inputs (Ex: "20.0"), I can populate a fields of float type. However, the default behavior of DataFrame silently proceeding when a type cast fails is unacceptable for my use case.

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.12.2 (tags/v3.12.2:6abddd9, Feb 6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)] ```
stinodego commented 3 months ago

Thanks for the report. This is a bug and has to do with row-wise construction. If you can pass your data in a columnar fashion, you will get the behavior you expect:

import polars as pl

df = pl.DataFrame(
    {"x": [["10", "twenty", "30"]]},
    schema={"x": pl.List(pl.Int64)},
)
TypeError: unexpected value while building Series of type Int64; found value of type String: "10"

Hint: Try setting `strict=False` to allow passing data with mixed types.
josiahlund commented 3 months ago

Thanks for the report. This is a bug and has to do with row-wise construction. If you can pass your data in a columnar fashion, you will get the behavior you expect:

import polars as pl

df = pl.DataFrame(
    {"x": [["10", "twenty", "30"]]},
    schema={"x": pl.List(pl.Int64)},
)
TypeError: unexpected value while building Series of type Int64; found value of type String: "10"

Hint: Try setting `strict=False` to allow passing data with mixed types.

Okay. So just to make sure I'm understand things properly, there is no intention to allow typecasting (except for int to float) when strict=true?

Should I open up a new issue if I want to request the ability to allow typecasting as is done with strict=false, but disallow failed casting?

For context, my use case is building dataframes from existing text files. The files contain text representations of numbers, so I would like to pass these as inputs, but I still want to be alerted if there is a value that cannot be cast successfully.

stinodego commented 3 months ago

If you have a Python data structure with lots of strings in it, I would recommend creating a DataFrame with String columns and then casting using cast. That will get you the behavior you need. And it will most likely be much more efficient than reading with strict=False.

josiahlund commented 3 months ago

Thank you, I'll give that a try.

josiahlund commented 3 months ago

Looks like the breakeven point for cast being faster is ~2k entries. Below that, I'm seeing better performance casting the values before populating the database. Seem reasonable?

artemru commented 3 months ago

Hello) my issue seems to be related, so let me post the reproducer in this thread :

import pyarrow as pa
import polars as pl # polars-1.0.0
tt = pa.Table.from_pydict({"col" : ["a", "a", "b", "b"],
                           "col1": [{"x" : 1, "y" : 2}] * 4})
tt = tt.cast(pa.schema([pa.field("col", pa.dictionary(pa.int32(), pa.string())),
                        pa.field("col1", pa.struct([("x", pa.int32()), ("y", pa.int32())]))]))
pl.from_arrow(tt) # OK

tt_bad = pa.concat_tables([tt.slice(0, 2), tt.slice(2, 2)])  # same table but chunked into two pieces
pl.from_arrow(tt_bad.select(["col"])) # works
pl.from_arrow(tt_bad.select(["col1"])) # works
pl.from_arrow(tt_bad) # This fails
traceback

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File .../python3.8/site-packages/polars/_utils/construction/series.py:296, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict) 295 try: --> 296 return constructor(name, values, strict) 297 except TypeError: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) File .../python3.8/site-packages/polars/_utils/getitem.py:156, in get_df_item_by_key(df, key) 155 try: --> 156 return _select_rows(df, key) # type: ignore[arg-type] 157 except TypeError: File .../python3.8/site-packages/polars/_utils/getitem.py:295, in _select_rows(df, key) 294 _raise_on_boolean_mask() --> 295 s = pl.Series("", key, dtype=Int64) 296 indices = _convert_series_to_indices(s, df.height) File .../python3.8/site-packages/polars/series/series.py:287, in Series.__init__(self, name, values, dtype, strict, nan_to_null) 286 if isinstance(values, Sequence): --> 287 self._s = sequence_to_pyseries( 288 name, 289 values, 290 dtype=dtype, 291 strict=strict, 292 nan_to_null=nan_to_null, 293 ) 295 elif values is None: File .../python3.8/site-packages/polars/_utils/construction/series.py:134, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null) 133 constructor = polars_type_to_constructor(dtype) --> 134 pyseries = _construct_series_with_fallbacks( 135 constructor, name, values, dtype, strict=strict 136 ) 137 if dtype in ( 138 Date, 139 Datetime, (...) 145 Decimal, 146 ): File .../python3.8/site-packages/polars/_utils/construction/series.py:301, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict) 300 else: --> 301 return PySeries.new_from_any_values_and_dtype( 302 name, values, dtype, strict=strict 303 ) TypeError: unexpected value while building Series of type Int64; found value of type String: "col" Hint: Try setting `strict=False` to allow passing data with mixed types. During handling of the above exception, another exception occurred: ColumnNotFoundError Traceback (most recent call last) Cell In[54], line 12 10 pl.from_arrow(tt_bad.select(["col"])) # works 11 pl.from_arrow(tt_bad.select(["col1"])) # works ---> 12 pl.from_arrow(tt_bad) # This fails File .../python3.8/site-packages/polars/convert/general.py:433, in from_arrow(data, schema, schema_overrides, rechunk) 370 """ 371 Create a DataFrame or Series from an Arrow Table or Array. 372 (...) 429 ] 430 """ # noqa: W505 431 if isinstance(data, (pa.Table, pa.RecordBatch)): 432 return wrap_df( --> 433 arrow_to_pydf( 434 data=data, 435 rechunk=rechunk, 436 schema=schema, 437 schema_overrides=schema_overrides, 438 ) 439 ) 440 elif isinstance(data, (pa.Array, pa.ChunkedArray)): 441 name = getattr(data, "_name", "") or "" File .../python3.8/site-packages/polars/_utils/construction/dataframe.py:1182, in arrow_to_pydf(data, schema, schema_overrides, strict, rechunk) 1179 reset_order = True 1181 if reset_order: -> 1182 df = df[names] 1183 pydf = df._df 1185 if column_names != original_schema and (schema_overrides or original_schema): File .../python3.8/site-packages/polars/dataframe/frame.py:1183, in DataFrame.__getitem__(self, key) 1169 def __getitem__( 1170 self, 1171 key: ( (...) 1180 ), 1181 ) -> DataFrame | Series | Any: 1182 """Get part of the DataFrame as a new DataFrame, Series, or scalar.""" -> 1183 return get_df_item_by_key(self, key) File .../python3.8/site-packages/polars/_utils/getitem.py:158, in get_df_item_by_key(df, key) 156 return _select_rows(df, key) # type: ignore[arg-type] 157 except TypeError: --> 158 return _select_columns(df, key) File .../python3.8/site-packages/polars/_utils/getitem.py:206, in _select_columns(df, key) 204 return _select_columns_by_index(df, key) # type: ignore[arg-type] 205 elif isinstance(first, str): --> 206 return _select_columns_by_name(df, key) # type: ignore[arg-type] 207 else: 208 msg = f"cannot select columns using Sequence with elements of type {type(first).__name__!r}" File .../python3.8/site-packages/polars/_utils/getitem.py:254, in _select_columns_by_name(df, key) 253 def _select_columns_by_name(df: DataFrame, key: Iterable[str]) -> DataFrame: --> 254 return df._from_pydf(df._df.select(key)) ColumnNotFoundError: col

Can you confirm that we deal the same issue here ?

cmdlineluser commented 3 months ago

@artemru That looks unrelated. You should open a new issue about pl.from_arrow for that example.

josiahlund commented 3 months ago

Yeah, I agree with @cmdlineluser that that definitely is unrelated. Different library and the nature of the problem is different. Yours raises an error on something that should work. Mine doesn't on something that shouldn't work, but does.

MariusMerkleQC commented 2 months ago

Reading the title, the following issue seems related:

import polars as pl

df_1 = pl.DataFrame(
    data=[("a", 1), ("b", 2.0)],
    schema=["key", "value"],
    orient="row",
)

df_2 = pl.DataFrame(
    data={"key": ["a", "b"], "value": [1, 2.0]},
)

The construction of df_1 does not raise an error, while the construction of df_2 raises an error:

Hint: Try setting `strict=False` to allow passing data with mixed types.