Open Julian-J-S opened 11 months ago
I personally would be in favor of skipping whitespace in the cast as it shouldn't cost too much performance even if there is never any whitespace.
In read csv we now treat all whitespace as data. Cleaning up whitespace is compute that should be done afterwards.
In read csv we now treat all whitespace as data. Cleaning up whitespace is compute that should be done afterwards.
Agree 👍🏻, this is the way 😎
Description
when casting types polars handles whitespace differently across datatypes and functions.
Parsing Integers (polars: no whitespace allowed; all others: allowed)
when parsing integers (
cast
/to_integer
) polars does not allow any whitespace. This is in contrast to all other common libraries.Parsing Dates (polars: whitespace allowed)
on the other hand when parsing dates in polars whitespace is fine
Parsing Integers using
read_csv
Using
read_csv
there seems to again be a different logic. In this case leading whitespace is allowed but no trailing whitespaceso
7
is fine and becomes 7 (number) but7
is not and stays7
(text)Problem Summary
Inconsistent Parsing Across Types
using
cast(<TYPE>)
null
(ifstrict=False
); e.g.7
-> null2023-01-01
-> 2023-01-01 (Date)Inconsistent Across Functions
parsing intergers using
cast
VS usingread_csv
cast
: no whitespace allowed anywhere; e.g.7
/7
-> nullread_csv
: leading whitespace allowed but not trailing; e.g.7
-> 7 (number) BUT7
->7
(text)Goal
It would be really awesome if polars had a consistent casting stragety across all functions and types.
As a user this is really problematic and I would guess that there are many bugs already in production because users dont even realise that casting fails because of some whitespace because it works in other functions or for other types.
I do not care if whitespace should be allowed or not but it should be consistent.