Open dbr opened 2 years ago
The inference is not exposed. We could do that. For the time being, you could hack a solution together by writing 100 rows as a csv to an io.BytesIO
and then parse that.
Its hacky, but it works.
one idea is to cast the column to type and count the number of failed casts / null values. A cleaned column should yield post-cast null_count equal to the pre-cast null_count.
Hi @ritchie46 / @stinodego - what are you thoughts on accepting this request?
It has come up on SO here, and I am looking to do the same after a pivot
# Attributes and values are externally defined, and not known in advance
df_from_db = pl.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"attribute": ["name", "age", "height", "name", "age", "height"],
"value": ["John", "30", "180", "Mary", "25", "165"],
})
df = df_from_db.pivot(
index="id",
columns="attribute",
values="value",
aggregate_function=None,
)
print(df)
shape: (2, 4)
┌─────┬──────┬─────┬────────┐
│ id ┆ name ┆ age ┆ height │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═════╪══════╪═════╪════════╡
│ 1 ┆ John ┆ 30 ┆ 180 │
│ 2 ┆ Mary ┆ 25 ┆ 165 │
└─────┴──────┴─────┴────────┘
It would be great to expose schema (re)inference at the expression level and at the DataFrame level.
FWIW, the workaround of the CSV roundtrip does work
inferred_schema = pl.read_csv(df.head(100).write_csv().encode()).schema
df = df.cast(inferred_schema)
We could add Series.infer_dtype
and DataFrame.infer_schema/infer_dtypes
. Happy to review a PR for this that exposes the existing functionality.
Would love this feature! It's a killer feature in systems that implement it (e.g., Power BI's Power Query transformations), especially for rapid prototyping!
Hi all,
Commenting also in support of this feature. I asked a question a while back on Stack Overflow related to it, trying to figure out a way to infer data types from a very large all-string parquet file while streaming. If the type inference was exposed to the LazyFrame and DataFrame apis, that would be amazing.
Also commenting in support of having the inference api exposed for lazyframes/dataframes.
If I load data from a CSV file, polars makes a pretty good attempt at inferring the schema.
I would like a way to re-do this inference/type conversion on an existing dataframe
For example say I've loaded this data from somewhere, but there is something throwing of the inference:
I can do some processing to clean up this data:
The question is then: how can I re-infer a good data type for
column_1
and cast it?It seems like the logic used in the CSV reader is quite specific to that format, is there something similar in the main API to consistently determine if a string column could be cast to int/float/etc
Manually implementing the logic for simple cases (e.g ints) is easy enough, but it becomes much more tricky and error prone when considering floats, dates/times, etc