pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.34k stars 1.96k forks source link

`cast` and other parsers should handle whitespace consistently #12767

Open Julian-J-S opened 11 months ago

Julian-J-S commented 11 months ago

Description

when casting types polars handles whitespace differently across datatypes and functions.

Parsing Integers (polars: no whitespace allowed; all others: allowed)

when parsing integers (cast/to_integer) polars does not allow any whitespace. This is in contrast to all other common libraries.

DATA = ["7", "     7", "7     "]

# polars
pl.DataFrame({"text": DATA}).with_columns(
    cast=pl.all().cast(pl.Int64, strict=False),
    to_integer=pl.all().str.to_integer(strict=False),
)
#┌────────┬──────┬────────────┐
#│ text   ┆ cast ┆ to_integer │
#│ ---    ┆ ---  ┆ ---        │
#│ str    ┆ i64  ┆ i64        │
#╞════════╪══════╪════════════╡
#│ 7      ┆ 7    ┆ 7          │ <<< only works without any whitespace
#│      7 ┆ null ┆ null       │
#│ 7      ┆ null ┆ null       │
#└────────┴──────┴────────────┘

# pandas
pd.DataFrame({"text": DATA}).assign(
    astype_int=lambda df_: df_["text"].astype("int32")
)
 #    text  astype_int
#0       7           7
#1       7           7
#2  7                7

# duckdb
duckdb.sql(
    """
select *, columns(*)::int as val
from (select '7', '     7', '7     ')
"""
)
#┌─────────┬──────────┬──────────┬───────┬───────┬───────┐
#│   '7'   │ '     7' │ '7     ' │  val  │  val  │  val  │
#│ varchar │ varchar  │ varchar  │ int32 │ int32 │ int32 │
#├─────────┼──────────┼──────────┼───────┼───────┼───────┤
#│ 7       │      7   │ 7        │     7 │     7 │     7 │
#└─────────┴──────────┴──────────┴───────┴───────┴───────┘

# python native
(int("7"), int("     7"), int("7     "), int("     7     "))
# (7, 7, 7, 7)

Parsing Dates (polars: whitespace allowed)

on the other hand when parsing dates in polars whitespace is fine

DATA = ["2023-01-01", "     2023-01-01", "2023-01-01     "]

pl.DataFrame({"text": DATA}).with_columns(
    cast=pl.all().cast(pl.Date, strict=False),
)
#┌─────────────────┬────────────┐
#│ text            ┆ cast       │
#│ ---             ┆ ---        │
#│ str             ┆ date       │
#╞═════════════════╪════════════╡
#│ 2023-01-01      ┆ 2023-01-01 │
#│      2023-01-01 ┆ 2023-01-01 │
#│ 2023-01-01      ┆ 2023-01-01 │
#└─────────────────┴────────────┘

Parsing Integers using read_csv

Using read_csv there seems to again be a different logic. In this case leading whitespace is allowed but no trailing whitespace

pl.read_csv(StringIO("none,leading,trailing\n7,     7,7     "))
#┌──────┬─────────┬──────────┐
#│ none ┆ leading ┆ trailing │
#│ ---  ┆ ---     ┆ ---      │
#│ i64  ┆ i64     ┆ str      │
#╞══════╪═════════╪══════════╡
#│ 7    ┆ 7       ┆ 7        │
#└──────┴─────────┴──────────┘

so

Problem Summary

Inconsistent Parsing Across Types

using cast(<TYPE>)

Inconsistent Across Functions

parsing intergers using cast VS using read_csv

Goal

It would be really awesome if polars had a consistent casting stragety across all functions and types.

As a user this is really problematic and I would guess that there are many bugs already in production because users dont even realise that casting fails because of some whitespace because it works in other functions or for other types.

I do not care if whitespace should be allowed or not but it should be consistent.

orlp commented 11 months ago

I personally would be in favor of skipping whitespace in the cast as it shouldn't cost too much performance even if there is never any whitespace.

ritchie46 commented 2 months ago

In read csv we now treat all whitespace as data. Cleaning up whitespace is compute that should be done afterwards.

Julian-J-S commented 2 months ago

In read csv we now treat all whitespace as data. Cleaning up whitespace is compute that should be done afterwards.

Agree 👍🏻, this is the way 😎