Open josiahlund opened 3 months ago
Thanks for the report. This is a bug and has to do with row-wise construction. If you can pass your data in a columnar fashion, you will get the behavior you expect:
import polars as pl
df = pl.DataFrame(
{"x": [["10", "twenty", "30"]]},
schema={"x": pl.List(pl.Int64)},
)
TypeError: unexpected value while building Series of type Int64; found value of type String: "10"
Hint: Try setting `strict=False` to allow passing data with mixed types.
Thanks for the report. This is a bug and has to do with row-wise construction. If you can pass your data in a columnar fashion, you will get the behavior you expect:
import polars as pl df = pl.DataFrame( {"x": [["10", "twenty", "30"]]}, schema={"x": pl.List(pl.Int64)}, )
TypeError: unexpected value while building Series of type Int64; found value of type String: "10" Hint: Try setting `strict=False` to allow passing data with mixed types.
Okay. So just to make sure I'm understand things properly, there is no intention to allow typecasting (except for int to float) when strict=true
?
Should I open up a new issue if I want to request the ability to allow typecasting as is done with strict=false
, but disallow failed casting?
For context, my use case is building dataframes from existing text files. The files contain text representations of numbers, so I would like to pass these as inputs, but I still want to be alerted if there is a value that cannot be cast successfully.
If you have a Python data structure with lots of strings in it, I would recommend creating a DataFrame with String columns and then casting using cast
. That will get you the behavior you need. And it will most likely be much more efficient than reading with strict=False
.
Thank you, I'll give that a try.
Looks like the breakeven point for cast
being faster is ~2k entries. Below that, I'm seeing better performance casting the values before populating the database. Seem reasonable?
Hello) my issue seems to be related, so let me post the reproducer in this thread :
import pyarrow as pa
import polars as pl # polars-1.0.0
tt = pa.Table.from_pydict({"col" : ["a", "a", "b", "b"],
"col1": [{"x" : 1, "y" : 2}] * 4})
tt = tt.cast(pa.schema([pa.field("col", pa.dictionary(pa.int32(), pa.string())),
pa.field("col1", pa.struct([("x", pa.int32()), ("y", pa.int32())]))]))
pl.from_arrow(tt) # OK
tt_bad = pa.concat_tables([tt.slice(0, 2), tt.slice(2, 2)]) # same table but chunked into two pieces
pl.from_arrow(tt_bad.select(["col"])) # works
pl.from_arrow(tt_bad.select(["col1"])) # works
pl.from_arrow(tt_bad) # This fails
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) File .../python3.8/site-packages/polars/_utils/construction/series.py:296, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict) 295 try: --> 296 return constructor(name, values, strict) 297 except TypeError: TypeError: 'str' object cannot be interpreted as an integer During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) File .../python3.8/site-packages/polars/_utils/getitem.py:156, in get_df_item_by_key(df, key) 155 try: --> 156 return _select_rows(df, key) # type: ignore[arg-type] 157 except TypeError: File .../python3.8/site-packages/polars/_utils/getitem.py:295, in _select_rows(df, key) 294 _raise_on_boolean_mask() --> 295 s = pl.Series("", key, dtype=Int64) 296 indices = _convert_series_to_indices(s, df.height) File .../python3.8/site-packages/polars/series/series.py:287, in Series.__init__(self, name, values, dtype, strict, nan_to_null) 286 if isinstance(values, Sequence): --> 287 self._s = sequence_to_pyseries( 288 name, 289 values, 290 dtype=dtype, 291 strict=strict, 292 nan_to_null=nan_to_null, 293 ) 295 elif values is None: File .../python3.8/site-packages/polars/_utils/construction/series.py:134, in sequence_to_pyseries(name, values, dtype, strict, nan_to_null) 133 constructor = polars_type_to_constructor(dtype) --> 134 pyseries = _construct_series_with_fallbacks( 135 constructor, name, values, dtype, strict=strict 136 ) 137 if dtype in ( 138 Date, 139 Datetime, (...) 145 Decimal, 146 ): File .../python3.8/site-packages/polars/_utils/construction/series.py:301, in _construct_series_with_fallbacks(constructor, name, values, dtype, strict) 300 else: --> 301 return PySeries.new_from_any_values_and_dtype( 302 name, values, dtype, strict=strict 303 ) TypeError: unexpected value while building Series of type Int64; found value of type String: "col" Hint: Try setting `strict=False` to allow passing data with mixed types. During handling of the above exception, another exception occurred: ColumnNotFoundError Traceback (most recent call last) Cell In[54], line 12 10 pl.from_arrow(tt_bad.select(["col"])) # works 11 pl.from_arrow(tt_bad.select(["col1"])) # works ---> 12 pl.from_arrow(tt_bad) # This fails File .../python3.8/site-packages/polars/convert/general.py:433, in from_arrow(data, schema, schema_overrides, rechunk) 370 """ 371 Create a DataFrame or Series from an Arrow Table or Array. 372 (...) 429 ] 430 """ # noqa: W505 431 if isinstance(data, (pa.Table, pa.RecordBatch)): 432 return wrap_df( --> 433 arrow_to_pydf( 434 data=data, 435 rechunk=rechunk, 436 schema=schema, 437 schema_overrides=schema_overrides, 438 ) 439 ) 440 elif isinstance(data, (pa.Array, pa.ChunkedArray)): 441 name = getattr(data, "_name", "") or "" File .../python3.8/site-packages/polars/_utils/construction/dataframe.py:1182, in arrow_to_pydf(data, schema, schema_overrides, strict, rechunk) 1179 reset_order = True 1181 if reset_order: -> 1182 df = df[names] 1183 pydf = df._df 1185 if column_names != original_schema and (schema_overrides or original_schema): File .../python3.8/site-packages/polars/dataframe/frame.py:1183, in DataFrame.__getitem__(self, key) 1169 def __getitem__( 1170 self, 1171 key: ( (...) 1180 ), 1181 ) -> DataFrame | Series | Any: 1182 """Get part of the DataFrame as a new DataFrame, Series, or scalar.""" -> 1183 return get_df_item_by_key(self, key) File .../python3.8/site-packages/polars/_utils/getitem.py:158, in get_df_item_by_key(df, key) 156 return _select_rows(df, key) # type: ignore[arg-type] 157 except TypeError: --> 158 return _select_columns(df, key) File .../python3.8/site-packages/polars/_utils/getitem.py:206, in _select_columns(df, key) 204 return _select_columns_by_index(df, key) # type: ignore[arg-type] 205 elif isinstance(first, str): --> 206 return _select_columns_by_name(df, key) # type: ignore[arg-type] 207 else: 208 msg = f"cannot select columns using Sequence with elements of type {type(first).__name__!r}" File .../python3.8/site-packages/polars/_utils/getitem.py:254, in _select_columns_by_name(df, key) 253 def _select_columns_by_name(df: DataFrame, key: Iterable[str]) -> DataFrame: --> 254 return df._from_pydf(df._df.select(key)) ColumnNotFoundError: col
Can you confirm that we deal the same issue here ?
@artemru That looks unrelated. You should open a new issue about pl.from_arrow
for that example.
Yeah, I agree with @cmdlineluser that that definitely is unrelated. Different library and the nature of the problem is different. Yours raises an error on something that should work. Mine doesn't on something that shouldn't work, but does.
Reading the title, the following issue seems related:
import polars as pl
df_1 = pl.DataFrame(
data=[("a", 1), ("b", 2.0)],
schema=["key", "value"],
orient="row",
)
df_2 = pl.DataFrame(
data={"key": ["a", "b"], "value": [1, 2.0]},
)
The construction of df_1
does not raise an error, while the construction of df_2
raises an error:
Hint: Try setting `strict=False` to allow passing data with mixed types.
Checks
Reproducible example
I'm new to using
Polars
and have run across a few things that have been confusing to me regarding type casing and strictness of type consistency. I'm sure that some of the confusion here is rooted in my own inexperience with how things are supposed to be used, but I do think that there's at a minimum, an opportunity to improve the clarity of documentation, if not the consistency of behaviors betweenDataFrame
andSeries
.From the documentation of
DataFrame
, it is clear to me that when usingstrict=False
, an input that is unsuccessfully cast to my called out data type will be replaced bynull
.I (incorrectly) inferred that with the default behavior of
strict=True
, an error would be thrown if a type cast failed. Here's an example of what I thought should fail working:Input:
Output:
Log output
No response
Issue description
In further explorations, here are the other behaviors that I've been left scratching my head at:
Behavioral differences between Series and DataFrame
Series
Attempting to create a series of floats from string inputs (
strict=true
) raises aTypeError
.Input:
DataFrame
Attempting to create a DataFrame of floats from string inputs (
strict=true
) casts the strings to floats.Input:
Output:
Selective Strictness
So Series are totally rigid in their type strictness? Nope. A Series will cast from
int
tofloat
. (Does this work becauseint
is a subclass offloat
?)Input:
Output:
Expected behavior
As somebody totally new to
Polars
, I would expect the type-casting behavior betweenDataFrame
andSeries
to be consistent. I do appreciate the flexibility that comes as a result ofDataFrame
attempting to cast types and that even with string inputs (Ex: "20.0"
), I can populate a fields offloat
type. However, the default behavior ofDataFrame
silently proceeding when a type cast fails is unacceptable for my use case.Installed versions