pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.59k stars 1.89k forks source link

Default data type for read_csv #8230

Open MariusMerkleQC opened 1 year ago

MariusMerkleQC commented 1 year ago

Problem description

In pandas, one could set a default value for dtype in the read_csv function. In polars, it is only possible to provide a dictionary mapping from column name to data type or a list of data types with one entry per column.

It would be great to add the default value for dtype to polars. 🚀

alexander-beedie commented 1 year ago

Out of curiosity... are you looking to read everything in as string? (Which is often -though not always- what "set every dtype to the same thing" translates to when loading CSV data). If so:

# load all data as Utf8
df = pl.read_csv(... , infer_schema_length=0 )

If not, what's the use-case/dtype you're interested in?

MariusMerkleQC commented 1 year ago

Yes, that's what I tried to do and how I solved it. I still feel like having a default argument for dtypes would be cleaner, what do you think? @alexander-beedie

mcrumiller commented 1 year ago

One feature I like about pandas is the ability to use a defaultdict instead of a dict. If I know that I have some columns with metadata, and then a couple hundred columns containing financial data, it's nice to say "use date/str/cat on these columns, and default to f32", as in:

dtypes = defaultdict(
    lambda: pl.Float32,  # default value
    {
        'a': pl.Utf8,
        'b': pl.Categorical,
        'c': pl.Date
    }
)
pl.read_csv(file, dtypes=dtypes)
borchero commented 1 year ago

@alexander-beedie it's probably hard to pass the defaultdict (or the lambda, in particular) to Rust? I would be interested in such a feature though, could we define a dedicated polars "datatype mapping"?

KubaSzostak commented 6 months ago

In my case all columns have Float32 type. When I try to read the CSV I get:

ComputeError: could not parse `1.5406785` as dtype `i64` at column 'col443' (column number 443)

To to get around this I have to read the column names first and based on that create a dictionary:

import polars as pl
import csv

with open(csv_file, 'r') as f:
    reader = csv.reader(f)
    column_names = next(reader)
    column_types = {name: pl.datatypes.Float32 for name in column_names}

df = pl.read_csv(csv_file, dtypes=column_types)
print(df)
speedy1601 commented 1 month ago

I tried to do this with

df = pl.read_csv("D:\\datasets\\temp.csv", schema_overrides={'Marks': pl.UInt8}, infer_schema_length=10000, ignore_errors=True)

The issue is values like 87.89 become null and values like 45.0 becomes 45.. I think the better way is to cast it after reading the csv.. If you have found how to do this inside read_csv, please let me know too!

KubaSzostak commented 1 month ago

Hi @speedy1601, I can see you are using pl.UInt8 which is 8-bit unsigned integer type. Maybe you should change it o pl.Float64 or other floating point type to make it working?

speedy1601 commented 1 month ago

I wanted The Floating Columns to read as u8 datatype in read_csv(), so pl.Flot64 not gonna help here. Hence I inferred that read the csv as it is and with a single line you can even convert all numeric columns to u8 or your desired dtype.

import polars
import polars.selectors as cs

df = pl.read_csv("D:\\datasets\\temp.csv")
print(df, extra_info("Original df"))

print(df.with_columns(cs.numeric().cast(pl.UInt8)), extra_info("All Numeric Columns casted to u8 dtype"))
print(df.with_columns(cs.numeric().exclude('Age').cast(pl.UInt8)), extra_info("All Numeric Columns except the column 'Age' casted to u8 dtype"))

Output :

shape: (3, 3)
┌───────┬─────┬───────┐
│ Name  ┆ Age ┆ Marks │
│ ---   ┆ --- ┆ ---   │
│ str   ┆ i64 ┆ f64   │
╞═══════╪═════╪═══════╡
│ Maria ┆ 12  ┆ 89.87 │
│ Aria  ┆ 13  ┆ null  │
│ Saria ┆ 14  ┆ 45.0  │
└───────┴─────┴───────┘  --> Original df
------------------------------------------------------------------------------------------------------------------------

shape: (3, 3)
┌───────┬─────┬───────┐
│ Name  ┆ Age ┆ Marks │
│ ---   ┆ --- ┆ ---   │
│ str   ┆ u8  ┆ u8    │
╞═══════╪═════╪═══════╡
│ Maria ┆ 12  ┆ 89    │
│ Aria  ┆ 13  ┆ null  │
│ Saria ┆ 14  ┆ 45    │
└───────┴─────┴───────┘  --> All Numeric Columns casted to u8 dtype
------------------------------------------------------------------------------------------------------------------------

shape: (3, 3)
┌───────┬─────┬───────┐
│ Name  ┆ Age ┆ Marks │
│ ---   ┆ --- ┆ ---   │
│ str   ┆ i64 ┆ u8    │
╞═══════╪═════╪═══════╡
│ Maria ┆ 12  ┆ 89    │
│ Aria  ┆ 13  ┆ null  │
│ Saria ┆ 14  ┆ 45    │
└───────┴─────┴───────┘  --> All Numeric Columns except the column 'Age' casted to u8 dtype
------------------------------------------------------------------------------------------------------------------------