pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.02k stars 1.72k forks source link

`read_csv`/`scan_csv` overwrites column names when `len(dtypes.keys()) >= len(df.columns)` #13574

Open mcrumiller opened 6 months ago

mcrumiller commented 6 months ago

Checks

Reproducible example

from io import StringIO
import polars as pl

csv = StringIO(
    "coll,col2\n"
    "a,1\n"
    "b,2\n"
)

# nothing happens, since `a` is not a column
print("one key:\n",
    pl.read_csv("csv, dtypes={"a": pl.Categorical})
)

# columns are renamed to `a` and `b` and dtype conversion occurs
print("\ntwo keys:\n",
    pl.read_csv(csv, dtypes={"a": pl.Categorical, "b": pl.UInt8})
)

Log output

one key
 shape: (2, 2)
┌──────┬──────┐
│ coll ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ a    ┆ 1    │
│ b    ┆ 2    │
└──────┴──────┘

two keys:
 shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ cat ┆ u8  │
╞═════╪═════╡
│ a   ┆ 1   │
│ b   ┆ 2   │
└─────┴─────┘

Issue description

When dtypes is supplied to either read_csv or scan_csv, if the number of keys in the dictionary is greater than or equal to the width of the frame, then the column names and dtypes are overwritten (including erroring if the conversion is invalid). If the number of keys is less than the width of the frame, nothing happens.

Expected behavior

Nothing should happen in both cases: the keys of the dtypes parameter should specify dtypes for existing columns only.

Installed versions

``` --------Version info--------- Polars: 0.20.3 Index type: UInt32 Platform: Windows-10-10.0.22621-SP0 Python: 3.11.4 (tags/v3.11.4:d2340ef, Jun 7 2023, 05:45:37) [MSC v.1934 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 3.0.0 connectorx: 0.3.2 deltalake: 0.15.1 fsspec: 2023.12.2 gevent: 23.9.1 hvplot: 0.9.1 matplotlib: 3.8.2 numpy: 1.26.3 openpyxl: 3.1.2 pandas: 2.1.4 pyarrow: 14.0.2 pydantic: 2.5.3 pyiceberg: 0.5.1 pyxlsb: 1.0.10 sqlalchemy: 2.0.25 xlsx2csv: 0.8.1 xlsxwriter: 3.1.9 ```
mcrumiller commented 6 months ago

@Wainberg you might want to add this to your read_csv list.

I think perhaps a re-hashing of the argument logic in read_csv in the python function is order, which processes all the arguments. There are a lot and it's easy for the arguments to interfere with each other.

It is also unclear if new_columns is supplied whether the keys of, say, the dtypes parameter should reference the new or old columns, since they can overlap. I would suggest that if new_columns is present, then all other parameters should use the new_columns names instead (since using new_columns is usually to avoid using really ugly names).

Wainberg commented 6 months ago

Just added it :)

romarowski commented 6 months ago

I can give this a try if needed, I had posted it on the discord

Wainberg commented 6 months ago

Go for it! We could really use all hands on deck fixing these CSV issues.

l1t1 commented 6 months ago

the original code has typos should be pl.read_csv(csv, dtypes={"a": pl.Categorical, "b": pl.UInt8})

romarowski commented 4 months ago

I've been doing some debugging and the issues comes when https://github.com/pola-rs/polars/blob/740e740d9ce3678ea061d5cb4c2bc94892838383/py-polars/polars/dataframe/frame.py#L748 PyDataFrame.read_csv() is called which I think is a rust method? I was trying to think a way of handling these in Python as touching the rust reader seems risky. Basically, polars doesn't know anything about the headers until going through rust? Otherwise I think it would be possible to remove the entries from the dict that dont have a matching column.

mcrumiller commented 4 months ago

Hi @romarowski,

Polars is primarily implemented in rust. What you see on the Python side is essentially an API that acts as a shell around the core rust implementation. Polars is primarily used in Python, so there are sometimes "python-specific" features of polars that we implement on the python side (dictionaries, contexts, etc.), but in general rust is orders of magnitude faster in every way. Plus, there is also a rust API, so rust users need to be able to access the same features.

What you're seeing is that read_csv in Python calls read_csv in rust, and that's where the heavy lifting is done. So yes, polars doesn't know anything about the headers until going through rust. If it helps, read_csv has quite a few issues and are on some of the developers' radars as needing some TLC, so hopefully this and a few other issues may be resolved in the near future.