pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.17k stars 1.94k forks source link

Schema inference fails with multiple csv files. #14326

Open Heiaha opened 9 months ago

Heiaha commented 9 months ago

Checks

Reproducible example

Take two csv files, a.csv and b.csv:

a.csv like:

col1,col2
1,2

b.csv like:

col1,col2
text1,text2

Attempt to read them in

df = pl.scan_csv(["a.csv", "b.csv"], infer_schema_length=None).collect()

Log output

>>> pl.scan_csv(["a.csv", "b.csv"], infer_schema_length=None).collect()
UNION: `parallel=false` union is run sequentially
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
file < 128 rows, no statistics determined
no. of chunks: 1 processed by: 1 threads.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ec2-user/anaconda3/envs/JupyterSystemEnv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1940, in collect
    return wrap_df(ldf.collect())
polars.exceptions.SchemaError: cannot extend/append Int64 with String

Issue description

Apologies if this has already been reported and I failed to find the relevant issue. When attempting to read multiple csv files, whether via glob pattern or list of file names, Polars fails to infer the schema if later csv files contain conflicting types with the first file. Same issue occurs even if ignore_errors=True.

Expected behavior

Polars would traverse through multiple files to fulfill the desired infer_schema_length if the first file is not long enough, looking at all lines in all files if infer_schema_length=None.

Installed versions

``` --------Version info--------- Polars: 0.20.7 Index type: UInt32 Platform: Linux-5.10.178-162.673.amzn2.x86_64-x86_64-with-glibc2.26 Python: 3.10.8 | packaged by conda-forge | (main, Nov 22 2022, 08:23:14) [GCC 10.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: fsspec: 2023.4.0 gevent: hvplot: matplotlib: numpy: 1.25.0 openpyxl: pandas: 2.0.2 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
deanm0000 commented 9 months ago

@stinodego I'm assuming this is not-planned but I'll defer closing.

Instead do:

df = pl.concat([
pl.scan_csv(x, infer_schema_length=None)
for x in ["a.csv", "b.csv"]],
how='vertical_relaxed').collect()
Heiaha commented 9 months ago

Thanks for the suggestion. I will say that even if this behavior is intended, it is definitely confusing when you encounter it. If it is not planned to fix, it may be worth making a note in the documentation.

Usernamelesss commented 8 months ago

Hi, I found a quite similar issue trying to concatenate many polars DataFrames loaded from different files (pickles in my case). Apology if this requires a dedicated issue, but it seems to me very related to this one. I reduced my test case with this MRE:

import pandas as pd
import polars as pl

df1 = pd.DataFrame({"A": [1, 2, 3]})
df2 = pd.DataFrame({"A": [None, None, None]})

df1, df2 = pl.from_pandas(df1), pl.from_pandas(df2)

pl.concat([df1, df2])

This raises the exception polars.exceptions.SchemaError: cannot extend/append Int64 with String and if I try to switch to vertical_relaxed I think the output is wrong because the Series becomes of type str

┌──────┐
│ A    │
│ ---  │
│ str  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
│ null │
│ null │
│ null │
└──────┘

Another interesting thing is that this happens only if I go through Pandas, because the same two dataframes created with polars outputs the correct series

import polars as pl

df1 = pl.DataFrame({"A": [1, 2, 3]})
df2 = pl.DataFrame({"A": [None, None, None]})
pl.concat([df1, df2])
┌──────┐
│ A    │
│ ---  │
│ i64  │
╞══════╡
│ 1    │
│ 2    │
│ 3    │
│ null │
│ null │
│ null │
└──────┘
cmdlineluser commented 8 months ago

@Usernamelesss You could probably open a separate issue for that.

For some reason .from_pandas is choosing str for an all null column instead of null:

pl.from_pandas(pd.DataFrame({"A": [None]}))
# shape: (1, 1)
# ┌──────┐
# │ A    │
# │ ---  │
# │ str  │ # <- str ???
# ╞══════╡
# │ null │
# └──────┘
pl.DataFrame({"A": [None]})
# shape: (1, 1)
# ┌──────┐
# │ A    │
# │ ---  │
# │ null │
# ╞══════╡
# │ null │
# └──────┘
ignacio-ireta commented 6 months ago

I have a very similar issue, but I cannot even know (at least from the error message) what column(s) or row(s) are causing the error since each file has more than 1M rows, and at least 50 columns. Is there a way get more information out of it?