Open MariusMerkleQC opened 1 year ago
Out of curiosity... are you looking to read everything in as string? (Which is often -though not always- what "set every dtype to the same thing" translates to when loading CSV data). If so:
# load all data as Utf8
df = pl.read_csv(... , infer_schema_length=0 )
If not, what's the use-case/dtype you're interested in?
Yes, that's what I tried to do and how I solved it. I still feel like having a default argument for dtypes
would be cleaner, what do you think? @alexander-beedie
One feature I like about pandas is the ability to use a defaultdict
instead of a dict
. If I know that I have some columns with metadata, and then a couple hundred columns containing financial data, it's nice to say "use date/str/cat on these columns, and default to f32", as in:
dtypes = defaultdict(
lambda: pl.Float32, # default value
{
'a': pl.Utf8,
'b': pl.Categorical,
'c': pl.Date
}
)
pl.read_csv(file, dtypes=dtypes)
@alexander-beedie it's probably hard to pass the defaultdict
(or the lambda
, in particular) to Rust? I would be interested in such a feature though, could we define a dedicated polars "datatype mapping"?
In my case all columns have Float32
type. When I try to read the CSV I get:
ComputeError: could not parse `1.5406785` as dtype `i64` at column 'col443' (column number 443)
To to get around this I have to read the column names first and based on that create a dictionary:
import polars as pl
import csv
with open(csv_file, 'r') as f:
reader = csv.reader(f)
column_names = next(reader)
column_types = {name: pl.datatypes.Float32 for name in column_names}
df = pl.read_csv(csv_file, dtypes=column_types)
print(df)
I tried to do this with
df = pl.read_csv("D:\\datasets\\temp.csv", schema_overrides={'Marks': pl.UInt8}, infer_schema_length=10000, ignore_errors=True)
The issue is values like 87.89 become null and values like 45.0 becomes 45.. I think the better way is to cast it after reading the csv.. If you have found how to do this inside read_csv, please let me know too!
Hi @speedy1601, I can see you are using pl.UInt8
which is 8-bit unsigned integer type. Maybe you should change it o pl.Float64
or other floating point type to make it working?
I wanted The Floating Columns to read as u8 datatype in read_csv()
, so pl.Flot64
not gonna help here. Hence I inferred that read the csv as it is and with a single line you can even convert all numeric columns to u8 or your desired dtype.
import polars
import polars.selectors as cs
df = pl.read_csv("D:\\datasets\\temp.csv")
print(df, extra_info("Original df"))
print(df.with_columns(cs.numeric().cast(pl.UInt8)), extra_info("All Numeric Columns casted to u8 dtype"))
print(df.with_columns(cs.numeric().exclude('Age').cast(pl.UInt8)), extra_info("All Numeric Columns except the column 'Age' casted to u8 dtype"))
shape: (3, 3)
┌───────┬─────┬───────┐
│ Name ┆ Age ┆ Marks │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 │
╞═══════╪═════╪═══════╡
│ Maria ┆ 12 ┆ 89.87 │
│ Aria ┆ 13 ┆ null │
│ Saria ┆ 14 ┆ 45.0 │
└───────┴─────┴───────┘ --> Original df
------------------------------------------------------------------------------------------------------------------------
shape: (3, 3)
┌───────┬─────┬───────┐
│ Name ┆ Age ┆ Marks │
│ --- ┆ --- ┆ --- │
│ str ┆ u8 ┆ u8 │
╞═══════╪═════╪═══════╡
│ Maria ┆ 12 ┆ 89 │
│ Aria ┆ 13 ┆ null │
│ Saria ┆ 14 ┆ 45 │
└───────┴─────┴───────┘ --> All Numeric Columns casted to u8 dtype
------------------------------------------------------------------------------------------------------------------------
shape: (3, 3)
┌───────┬─────┬───────┐
│ Name ┆ Age ┆ Marks │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u8 │
╞═══════╪═════╪═══════╡
│ Maria ┆ 12 ┆ 89 │
│ Aria ┆ 13 ┆ null │
│ Saria ┆ 14 ┆ 45 │
└───────┴─────┴───────┘ --> All Numeric Columns except the column 'Age' casted to u8 dtype
------------------------------------------------------------------------------------------------------------------------
Problem description
In pandas, one could set a default value for
dtype
in theread_csv
function. In polars, it is only possible to provide a dictionary mapping from column name to data type or a list of data types with one entry per column.It would be great to add the default value for
dtype
to polars. 🚀