Closed stinodego closed 5 months ago
Looks like a very sensible cleanup... (And no implicit compute by default :)) ✌️
For context, NumPy, pandas and pyarrow all automatically upcast to float64 when there's a mix of integers and floats:
>>> np.array([1, 2, 3.])
array([1., 2., 3.])
>>> pd.Series([1, 2, 3.])
0 1.0
1 2.0
2 3.0
dtype: float64
>>> pa.array([1, 2, 3.])
<pyarrow.lib.DoubleArray object at 0x7f8758fc4940>
[
1,
2,
3
]
whereas NumPy and pandas auto-cast strings to integers when setting dtype=int
, but pyrrow gives an error:
>>> np.array(['1', '2', '3'], dtype=int)
array([1, 2, 3])
>>> pd.Series(['1', '2', '3'], dtype=int)
0 1
1 2
2 3
dtype: int64
>>> pa.array(['1', '2', '3'], pa.int64())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int64
Similarly, NumPy and pandas auto-upcast to dtype=object
when some things are strings, but pyarrow gives an error:
>>> np.array([1.0, 2, 'a', date(2022, 1, 1), True, False])
array([1.0, 2, 'a', datetime.date(2022, 1, 1), True, False], dtype=object)
>>> pd.Series([1.0, 2, 'a', date(2022, 1, 1), True, False])
0 1.0
1 2
2 a
3 2022-01-01
4 True
5 False
dtype: object
>>> pa.array([1.0, 2, 'a', date(2022, 1, 1), True, False])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'a' with type str: tried to convert to double
None of the three libraries have a strict
argument, or silently set data to null
(or nan
). pyarrow has a safe
argument (True
by default) to "check for overflows or other unsafe conversions", but setting safe=False
doesn't affect any of the above cases.
I had a chance to discuss this with a couple of my students today. We all agreed that silently setting data to null
is dangerous (one of them called it "nuts"). They both liked how the strict mode disallows mixing ints and floats by default, even though I leaned towards allowing that.
We ultimately converged on the following proposal:
pl.Series([1, 2, 3]) # works
pl.Series([1, 2, 3.0]) # fails, inferred Int64 type, encountered float
pl.Series([1.0, 2, 3]) # fails, inferred Float64 type, encountered integer
pl.Series([1, 2, 3.0], dtype=pl.Float64) # works, Float64
pl.Series([1, 2, 3.0], dtype=pl.Int64) # works, Int64
pl.Series([1, 2, 3.5], dtype=pl.Int64) # works, Float64 (casts 3.5 to 3)
pl.Series(['1', '2', '3']) # works, String
pl.Series(['1', '2', '3'], dtype=pl.Int64) # works, Int64
pl.Series([1.0, 2, 'a']) # fails
pl.Series([1.0, 2, 'a'], dtype=pl.String) # works, String
pl.Series([1.0, 2, 'a'], dtype=pl.Object) # works, Object
pl.Series([date(2022, 1, 1), True, False]) # fails
pl.Series([date(2022, 1, 1), True, False], dtype=pl.String) # works, String
pl.Series([date(2022, 1, 1), True, False], dtype=pl.Object) # works, Object
pl.Series([1, 2, '3', 'x'], dtype=pl.Int32) # fails: 'x' can't be cast to Int32
pl.Series([1, 2, '3', 'x'], dtype=pl.String) # works, String
pl.Series([1, 2, '3', 'x'], dtype=pl.Object) # works, Object
If you did want to allow the auto-nulling behavior, you could have a strict
parameter that matches the behavior of strict
in cast()
. In that case, the main difference from your proposal would be that setting a dtype would auto-cast even when strict=True
, and would only give an error when something couldn't be cast, rather than always giving an error.
If a dtype is specified, try to cast everything to that dtype, and fail if any element can't be cast.
The problem is that this is very slow. Every value is passed to the Rust side, then we have to figure out which data type it is, then we have to try to cast it, and then check if that worked correctly and potentially raise.
A strict mode allows for a much faster way to parse the data as we can make certain assumptions (all data is the same type / all data matches the given dtype).
A strict mode allows for a much faster way to parse the data
I don't think this is correct, but I could be missing something. You still have to check that all data matches the given dtype, even in strict mode, so that you know whether to raise an error. So allowing pl.Series([1, 2, 3.0], dtype=pl.Int64)
wouldn't slow down pl.Series([1, 2, 3])
, not even slightly.
In other words, there are no cases that would be made faster by having a strict mode; it just forces people to specify strict=False
if they want the slow cases.
The problem with your proposal is that there's no way to say "I want to allow mixed types, but please still raise an error if it's impossible to cast an element to my desired type". If the user wants to allow mixed types, they're forced to also allow dangerous auto-nulling.
In practice, to handle mixed types safely, you would either have to cast ahead of time in Python and use strict=True
, or use strict=False
and assert that no nulls were introduced due to the auto-nulling. Both these workarounds introduce additional burden on the user, and are also slower than simply attempting the cast in Rust and raising if the cast is impossible.
Perhaps we could make your proposed design performant. But the problem is that your dtype
parameter is now overloaded: it controls both the strictness and the return dtype.
Let's say I want to create a Series of Int8
and I want to be certain my input data is clean (no expensive casting). Using the API proposed in the original issue I can do pl.Series([...], dtype=pl.Int8)
and that's that.
Using your design, this is not possible. Specifying the dtype
enables casting non-integer values. So if there's a string in there with the value '5'
, it will be cast without my knowledge. Or if there is a float in there it will be cast and precision lost. This may be your intention, or it may not be. You could do pl.Series([...]).cast(pl.Int8)
but that requires materializing everything as Int64
first, which is unnecessarily expensive.
You make a compelling point. It feels like these three all need to be separate arguments:
My proposal mixes 1 and 2, which you correctly point out has disadvantages. Your proposal mixes 2 and 3, which has different disadvantages.
there another important point here that should not be missing.
Despite theoric optimizations here made from students. Polars has become in last two years a huge engineering production bastion, used by tons of enterprises on their data science or data engineering processes.
So there is an important element called backward compatibility that should take priority over theoric disertations up to a point, in such a way for any breaking change, the following hypothesis contrast needs to be made.
very good theorical change that suposes breaking change with no retrocompatibility: H0: it does not worth H1: it does worth in such a way past behaviour can be broken
And always on breaking changes, it needs to be analyzed all the cases that the breaking change misses, and offer alternatives at engineering level, in order to not break productive loads without alternative...
Tools are tools and when they are really good tools (like polars), plenty of use cases are built over them, specially on libraries that involve data engineering, the amount of use cases is vast and huge, so breaking changes need to take care about all relative use cases trying to not miss or give alternatives at engineering level when something is missed on the break.
schema_overrides
https://github.com/pola-rs/polars/issues/15471 is an example.
there another important point here that should not be missing
This is off-topic. You can read up on our policy for breaking changes in the user guide.
The functionality of schema_overrides
will be addressed when we get to DataFrames (currently focused on Series). The intended behavior is listed here: https://github.com/pola-rs/polars/issues/11723#issuecomment-1768346275
Are there any examples of Rust api calls for from_any_values_and_dtype?
(It's seemingly the case that Rust takes a lower priority in fleshing out documentation than Python bindings, which seems ironic.)
There have been a host of issues about the constructors of Series/DataFrame not working as expected. I recently sat down with Ritchie to think out the way it should work. We'll start with
Series
and addressDataFrame
later (they are related, of course).Intended behavior
There are four modes of construction:
Implementation
Rust
On the Rust side, these constructors are represented by two methods:
from_any_values_and_dtype
from_any_values
Both have a
strict
parameter.These methods will follow the logic outlined above.
Python
The Python side will not directly dispatch to the two Rust methods outlined above, because interpreting the full input as
AnyValue
is expensive. They will function as a fallback for when fast paths do not work.There are a few constructors defined in the Rust bindings. These constructors are for a specific dtype and will extract the Python object directly into that dtype. This may fail, hence the need for a fallback.
During strict construction, these fast constructors are used. The data type is determined either by the dtype input of the user, or by the first non-null entry in the data. If the constructor fails due to unexpected data, it raises an error.
During non-strict construction, we also attempt to use these fast constructors, using the same dtype inference. However, if an error occurs due to unexpected data, we fall back to the appropriate Rust constructor. This will be relatively slow, but this can be avoided by sanitizing your data beforehand and using strict construction instead.
Examples
*** Large integers may lose precision when interpreted as floats, so strict mode does not allow this.
Pros and cons
Pro:
Con: