Open david-waterworth opened 2 months ago
Yeah, pandas has thousands=
for that.
>>> pl.from_pandas(pd.read_csv(data, encoding="utf-16", sep="\t", thousands=","))
shape: (3, 3)
┌─────┬─────┬──────┐
│ Id ┆ A ┆ B │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪══════╡
│ 1 ┆ 537 ┆ 2288 │
│ 2 ┆ 325 ┆ 1047 │
│ 3 ┆ 98 ┆ 194 │
└─────┴─────┴──────┘
It seems .str.replace_all
is actually casting from i64 to str
>>> df.select(pl.col("A").str.replace_all(",", ""))
shape: (3, 1)
┌─────┐
│ A │
│ --- │
│ str │
╞═════╡
│ 537 │
│ 325 │
│ 98 │
└─────┘
Which feels like it should raise an InvalidOperationError
instead?
Description
I'm running into issues reading tsv files that have been exported as
utf-16
that include numeric columns formatted to contain thousand separators (i.e. comma's). When polars attempts to infer the schema, the type will flip/flop between str and int64 depending on whether the column contains a value > 999 or not. I posted on SO but all the replies were more refined versions of my work-around, i.e. to replace the thousands separator with a zero-length string before converting to integer.The issue here is I have to do this on any numeric column that contains a value > 999, which is data dependent. I found a work-around which is to simply use
replace
on all numeric columns even if they're inferred as int rather than string, i.e.This seems like something that could break in future releases - does
pl.col('A', 'B').str
select both A and B and then filter by type (pl.String) before calling replace? otherwise it seems like a side-effect?In any case it's awkward, if I specify that a column in Int64 (i.e. using
schema=
or .cast(pl.Int64)) and it's in a format that can be coerced to an Int from a string, ideally Polars should do this - but when there's a thousands separator it doesn't. Or at least there should be an explcitstr.convert(pl.Int64)
that works in this case?