Open mcrumiller opened 6 months ago
@Wainberg you might want to add this to your read_csv
list.
I think perhaps a re-hashing of the argument logic in read_csv
in the python function is order, which processes all the arguments. There are a lot and it's easy for the arguments to interfere with each other.
It is also unclear if new_columns
is supplied whether the keys of, say, the dtypes
parameter should reference the new or old columns, since they can overlap. I would suggest that if new_columns
is present, then all other parameters should use the new_columns
names instead (since using new_columns
is usually to avoid using really ugly names).
Just added it :)
I can give this a try if needed, I had posted it on the discord
Go for it! We could really use all hands on deck fixing these CSV issues.
the original code has typos
should be
pl.read_csv(csv, dtypes={"a": pl.Categorical, "b": pl.UInt8})
I've been doing some debugging and the issues comes when https://github.com/pola-rs/polars/blob/740e740d9ce3678ea061d5cb4c2bc94892838383/py-polars/polars/dataframe/frame.py#L748 PyDataFrame.read_csv() is called which I think is a rust method? I was trying to think a way of handling these in Python as touching the rust reader seems risky. Basically, polars doesn't know anything about the headers until going through rust? Otherwise I think it would be possible to remove the entries from the dict that dont have a matching column.
Hi @romarowski,
Polars is primarily implemented in rust. What you see on the Python side is essentially an API that acts as a shell around the core rust implementation. Polars is primarily used in Python, so there are sometimes "python-specific" features of polars that we implement on the python side (dictionaries, contexts, etc.), but in general rust is orders of magnitude faster in every way. Plus, there is also a rust API, so rust users need to be able to access the same features.
What you're seeing is that read_csv
in Python calls read_csv
in rust, and that's where the heavy lifting is done. So yes, polars doesn't know anything about the headers until going through rust. If it helps, read_csv
has quite a few issues and are on some of the developers' radars as needing some TLC, so hopefully this and a few other issues may be resolved in the near future.
Checks
Reproducible example
Log output
Issue description
When
dtypes
is supplied to eitherread_csv
orscan_csv
, if the number of keys in the dictionary is greater than or equal to the width of the frame, then the column names and dtypes are overwritten (including erroring if the conversion is invalid). If the number of keys is less than the width of the frame, nothing happens.Expected behavior
Nothing should happen in both cases: the keys of the
dtypes
parameter should specify dtypes for existing columns only.Installed versions