pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.43k stars 1.97k forks source link

Series constructor logic overhaul #14427

Closed stinodego closed 5 months ago

stinodego commented 9 months ago

There have been a host of issues about the constructors of Series/DataFrame not working as expected. I recently sat down with Ritchie to think out the way it should work. We'll start with Series and address DataFrame later (they are related, of course).

Intended behavior

There are four modes of construction:

dtype provided strict
yes yes
yes no
no yes
no no

Implementation

Rust

On the Rust side, these constructors are represented by two methods:

Both have a strict parameter.

These methods will follow the logic outlined above.

Python

The Python side will not directly dispatch to the two Rust methods outlined above, because interpreting the full input as AnyValue is expensive. They will function as a fallback for when fast paths do not work.

There are a few constructors defined in the Rust bindings. These constructors are for a specific dtype and will extract the Python object directly into that dtype. This may fail, hence the need for a fallback.

During strict construction, these fast constructors are used. The data type is determined either by the dtype input of the user, or by the first non-null entry in the data. If the constructor fails due to unexpected data, it raises an error.

During non-strict construction, we also attempt to use these fast constructors, using the same dtype inference. However, if an error occurs due to unexpected data, we fall back to the appropriate Rust constructor. This will be relatively slow, but this can be avoided by sanitizing your data beforehand and using strict construction instead.

Examples

# Strict mode is the default and always uses fast paths
pl.Series([1, 2, 3])  # works
pl.Series([1, 2, 3.0])  # fails, inferred Int64 type, encountered float
pl.Series([1.0, 2, 3])  # fails, inferred Float64 type, encountered integer***
pl.Series([1, 2, 3.0], dtype=pl.Float64)  # fails, dtype set to Float64, encountered integer***
pl.Series(['1', '2', '3'])  # works, String
pl.Series(['1', '2', '3'], dtype=pl.Int64)  # fails, strings cannot always be interpreted as integers

# Non-strict mode always works
pl.Series([1, 2, 3], strict=False)  # uses fast path, Int64
pl.Series([1, 2, 3.0], strict=False)  # uses slow fallback, Float64
pl.Series([1.0, 2, 3], strict=False)  # uses fast path, Float64
pl.Series([1.0, 2, 'a'], strict=False)  # uses slow fallback, all values cast to String
pl.Series([date(2022, 1, 1), True, False], strict=False)  # uses slow fallback, bools set to null, Date
pl.Series([1, 2, '3', 'x'], strict=False, dtype=pl.Int32)  # uses slow fallback, string values cast to int when possible, Int32

*** Large integers may lose precision when interpreted as floats, so strict mode does not allow this.

Pros and cons

Pro:

Con:

alexander-beedie commented 9 months ago

Looks like a very sensible cleanup... (And no implicit compute by default :)) ✌️

Wainberg commented 9 months ago

For context, NumPy, pandas and pyarrow all automatically upcast to float64 when there's a mix of integers and floats:

>>> np.array([1, 2, 3.])
array([1., 2., 3.])
>>> pd.Series([1, 2, 3.])
0    1.0
1    2.0
2    3.0
dtype: float64
>>> pa.array([1, 2, 3.])
<pyarrow.lib.DoubleArray object at 0x7f8758fc4940>
[
  1,
  2,
  3
]

whereas NumPy and pandas auto-cast strings to integers when setting dtype=int, but pyrrow gives an error:

>>> np.array(['1', '2', '3'], dtype=int)
array([1, 2, 3])
>>> pd.Series(['1', '2', '3'], dtype=int)
0    1
1    2
2    3
dtype: int64
>>> pa.array(['1', '2', '3'], pa.int64())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert '1' with type str: tried to convert to int64

Similarly, NumPy and pandas auto-upcast to dtype=object when some things are strings, but pyarrow gives an error:

>>> np.array([1.0, 2, 'a', date(2022, 1, 1), True, False])
array([1.0, 2, 'a', datetime.date(2022, 1, 1), True, False], dtype=object)
>>> pd.Series([1.0, 2, 'a', date(2022, 1, 1), True, False])
0           1.0
1             2
2             a
3    2022-01-01
4          True
5         False
dtype: object
>>> pa.array([1.0, 2, 'a', date(2022, 1, 1), True, False])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 344, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'a' with type str: tried to convert to double

None of the three libraries have a strict argument, or silently set data to null (or nan). pyarrow has a safe argument (True by default) to "check for overflows or other unsafe conversions", but setting safe=False doesn't affect any of the above cases.

Wainberg commented 9 months ago

I had a chance to discuss this with a couple of my students today. We all agreed that silently setting data to null is dangerous (one of them called it "nuts"). They both liked how the strict mode disallows mixing ints and floats by default, even though I leaned towards allowing that.

We ultimately converged on the following proposal:

pl.Series([1, 2, 3])  # works
pl.Series([1, 2, 3.0])  # fails, inferred Int64 type, encountered float
pl.Series([1.0, 2, 3])  # fails, inferred Float64 type, encountered integer
pl.Series([1, 2, 3.0], dtype=pl.Float64)  # works, Float64
pl.Series([1, 2, 3.0], dtype=pl.Int64)  # works, Int64
pl.Series([1, 2, 3.5], dtype=pl.Int64)  # works, Float64 (casts 3.5 to 3)
pl.Series(['1', '2', '3'])  # works, String
pl.Series(['1', '2', '3'], dtype=pl.Int64)  # works, Int64

pl.Series([1.0, 2, 'a'])  # fails
pl.Series([1.0, 2, 'a'], dtype=pl.String)  # works, String
pl.Series([1.0, 2, 'a'], dtype=pl.Object)  # works, Object

pl.Series([date(2022, 1, 1), True, False])  # fails
pl.Series([date(2022, 1, 1), True, False], dtype=pl.String)  # works, String
pl.Series([date(2022, 1, 1), True, False], dtype=pl.Object)  # works, Object

pl.Series([1, 2, '3', 'x'], dtype=pl.Int32)  # fails: 'x' can't be cast to Int32
pl.Series([1, 2, '3', 'x'], dtype=pl.String)  # works, String
pl.Series([1, 2, '3', 'x'], dtype=pl.Object)  # works, Object

If you did want to allow the auto-nulling behavior, you could have a strict parameter that matches the behavior of strict in cast(). In that case, the main difference from your proposal would be that setting a dtype would auto-cast even when strict=True, and would only give an error when something couldn't be cast, rather than always giving an error.

stinodego commented 9 months ago

If a dtype is specified, try to cast everything to that dtype, and fail if any element can't be cast.

The problem is that this is very slow. Every value is passed to the Rust side, then we have to figure out which data type it is, then we have to try to cast it, and then check if that worked correctly and potentially raise.

A strict mode allows for a much faster way to parse the data as we can make certain assumptions (all data is the same type / all data matches the given dtype).

Wainberg commented 9 months ago

A strict mode allows for a much faster way to parse the data

I don't think this is correct, but I could be missing something. You still have to check that all data matches the given dtype, even in strict mode, so that you know whether to raise an error. So allowing pl.Series([1, 2, 3.0], dtype=pl.Int64) wouldn't slow down pl.Series([1, 2, 3]), not even slightly.

In other words, there are no cases that would be made faster by having a strict mode; it just forces people to specify strict=False if they want the slow cases.

The problem with your proposal is that there's no way to say "I want to allow mixed types, but please still raise an error if it's impossible to cast an element to my desired type". If the user wants to allow mixed types, they're forced to also allow dangerous auto-nulling.

In practice, to handle mixed types safely, you would either have to cast ahead of time in Python and use strict=True, or use strict=False and assert that no nulls were introduced due to the auto-nulling. Both these workarounds introduce additional burden on the user, and are also slower than simply attempting the cast in Rust and raising if the cast is impossible.

stinodego commented 9 months ago

Perhaps we could make your proposed design performant. But the problem is that your dtype parameter is now overloaded: it controls both the strictness and the return dtype.

Let's say I want to create a Series of Int8 and I want to be certain my input data is clean (no expensive casting). Using the API proposed in the original issue I can do pl.Series([...], dtype=pl.Int8) and that's that.

Using your design, this is not possible. Specifying the dtype enables casting non-integer values. So if there's a string in there with the value '5', it will be cast without my knowledge. Or if there is a float in there it will be cast and precision lost. This may be your intention, or it may not be. You could do pl.Series([...]).cast(pl.Int8) but that requires materializing everything as Int64 first, which is unnecessarily expensive.

Wainberg commented 9 months ago

You make a compelling point. It feels like these three all need to be separate arguments:

  1. the return dtype
  2. whether to auto-cast
  3. whether to auto-null (if you wanted to support this; personally I would be fine always raising an error)

My proposal mixes 1 and 2, which you correctly point out has disadvantages. Your proposal mixes 2 and 3, which has different disadvantages.

Guillermogsjc commented 7 months ago

there another important point here that should not be missing.

Despite theoric optimizations here made from students. Polars has become in last two years a huge engineering production bastion, used by tons of enterprises on their data science or data engineering processes.

So there is an important element called backward compatibility that should take priority over theoric disertations up to a point, in such a way for any breaking change, the following hypothesis contrast needs to be made.

very good theorical change that suposes breaking change with no retrocompatibility: H0: it does not worth H1: it does worth in such a way past behaviour can be broken

And always on breaking changes, it needs to be analyzed all the cases that the breaking change misses, and offer alternatives at engineering level, in order to not break productive loads without alternative...

Tools are tools and when they are really good tools (like polars), plenty of use cases are built over them, specially on libraries that involve data engineering, the amount of use cases is vast and huge, so breaking changes need to take care about all relative use cases trying to not miss or give alternatives at engineering level when something is missed on the break.

schema_overrides https://github.com/pola-rs/polars/issues/15471 is an example.

stinodego commented 7 months ago

there another important point here that should not be missing

This is off-topic. You can read up on our policy for breaking changes in the user guide.

The functionality of schema_overrides will be addressed when we get to DataFrames (currently focused on Series). The intended behavior is listed here: https://github.com/pola-rs/polars/issues/11723#issuecomment-1768346275

johnjozwiakmodulus commented 7 months ago

Are there any examples of Rust api calls for from_any_values_and_dtype?

(It's seemingly the case that Rust takes a lower priority in fleshing out documentation than Python bindings, which seems ironic.)