Re-infer schema for a dataframe column

dbr commented 2 years ago

If I load data from a CSV file, polars makes a pretty good attempt at inferring the schema.

I would like a way to re-do this inference/type conversion on an existing dataframe

For example say I've loaded this data from somewhere, but there is something throwing of the inference:

>>> source = polars.DataFrame([[1, 2, 3], ["a", "22", "333"]])
>>> source
shape: (3, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ a        │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2        ┆ 22       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3        ┆ 333      │
└──────────┴──────────┘

I can do some processing to clean up this data:

>>> cleaned = source[1:]
>>> cleaned
shape: (2, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 2        ┆ 22       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3        ┆ 333      │
└──────────┴──────────┘

The question is then: how can I re-infer a good data type for column_1 and cast it?

It seems like the logic used in the CSV reader is quite specific to that format, is there something similar in the main API to consistently determine if a string column could be cast to int/float/etc

Manually implementing the logic for simple cases (e.g ints) is easy enough, but it becomes much more tricky and error prone when considering floats, dates/times, etc

ritchie46 commented 2 years ago

The inference is not exposed. We could do that. For the time being, you could hack a solution together by writing 100 rows as a csv to an io.BytesIO and then parse that.

Its hacky, but it works.

jorgecarleitao commented 2 years ago

one idea is to cast the column to type and count the number of failed casts / null values. A cleaned column should yield post-cast null_count equal to the pre-cast null_count.

henryharbeck commented 8 months ago

Hi @ritchie46 / @stinodego - what are you thoughts on accepting this request?

It has come up on SO here, and I am looking to do the same after a pivot

# Attributes and values are externally defined, and not known in advance
df_from_db = pl.DataFrame({
    "id": [1, 1, 1, 2, 2, 2],
    "attribute": ["name", "age", "height", "name", "age", "height"],
    "value": ["John", "30", "180", "Mary", "25", "165"],
})

df = df_from_db.pivot(
    index="id",
    columns="attribute",
    values="value",
    aggregate_function=None,
)

print(df)

shape: (2, 4)
┌─────┬──────┬─────┬────────┐
│ id  ┆ name ┆ age ┆ height │
│ --- ┆ ---  ┆ --- ┆ ---    │
│ i64 ┆ str  ┆ str ┆ str    │
╞═════╪══════╪═════╪════════╡
│ 1   ┆ John ┆ 30  ┆ 180    │
│ 2   ┆ Mary ┆ 25  ┆ 165    │
└─────┴──────┴─────┴────────┘

It would be great to expose schema (re)inference at the expression level and at the DataFrame level.

FWIW, the workaround of the CSV roundtrip does work

inferred_schema = pl.read_csv(df.head(100).write_csv().encode()).schema
df = df.cast(inferred_schema)

stinodego commented 8 months ago

We could add Series.infer_dtype and DataFrame.infer_schema/infer_dtypes. Happy to review a PR for this that exposes the existing functionality.

DeflateAwning commented 4 months ago

Would love this feature! It's a killer feature in systems that implement it (e.g., Power BI's Power Query transformations), especially for rapid prototyping!

Heiaha commented 2 months ago

Hi all,

Commenting also in support of this feature. I asked a question a while back on Stack Overflow related to it, trying to figure out a way to infer data types from a very large all-string parquet file while streaming. If the type inference was exposed to the LazyFrame and DataFrame apis, that would be amazing.

dgilbert418 commented 2 months ago

Also commenting in support of having the inference api exposed for lazyframes/dataframes.

pola-rs / polars

Re-infer schema for a dataframe column #2043