pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.18k stars 1.95k forks source link

Allow numbers with thousand seperators to be parsed as Integer in `read_csv` #18465

Open david-waterworth opened 2 months ago

david-waterworth commented 2 months ago

Description

I'm running into issues reading tsv files that have been exported as utf-16 that include numeric columns formatted to contain thousand separators (i.e. comma's). When polars attempts to infer the schema, the type will flip/flop between str and int64 depending on whether the column contains a value > 999 or not. I posted on SO but all the replies were more refined versions of my work-around, i.e. to replace the thousands separator with a zero-length string before converting to integer.

The issue here is I have to do this on any numeric column that contains a value > 999, which is data dependent. I found a work-around which is to simply use replace on all numeric columns even if they're inferred as int rather than string, i.e.

from io import BytesIO
import polars as pl

data = BytesIO(
"""
Id\tA\tB
1\t537\t2,288
2\t325\t1,047
3\t98\t194
""".encode("utf-16")
)

df = pl.read_csv(data, encoding="utf-16", separator="\t")
# B is pl.String()
df = df.with_columns(pl.col('A', 'B').str.replace_all(",", "").cast(pl.Int64))
# Now A and B are both Int64

This seems like something that could break in future releases - does pl.col('A', 'B').str select both A and B and then filter by type (pl.String) before calling replace? otherwise it seems like a side-effect?

In any case it's awkward, if I specify that a column in Int64 (i.e. using schema= or .cast(pl.Int64)) and it's in a format that can be coerced to an Int from a string, ideally Polars should do this - but when there's a thousands separator it doesn't. Or at least there should be an explcit str.convert(pl.Int64) that works in this case?

cmdlineluser commented 2 months ago

Yeah, pandas has thousands= for that.

>>> pl.from_pandas(pd.read_csv(data, encoding="utf-16", sep="\t", thousands=","))
shape: (3, 3)
┌─────┬─────┬──────┐
│ Id  ┆ A   ┆ B    │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ i64 ┆ i64  │
╞═════╪═════╪══════╡
│ 1   ┆ 537 ┆ 2288 │
│ 2   ┆ 325 ┆ 1047 │
│ 3   ┆ 98  ┆ 194  │
└─────┴─────┴──────┘

It seems .str.replace_all is actually casting from i64 to str

>>> df.select(pl.col("A").str.replace_all(",", ""))
shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ str │
╞═════╡
│ 537 │
│ 325 │
│ 98  │
└─────┘

Which feels like it should raise an InvalidOperationError instead?