pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.94k stars 1.93k forks source link

read_excel schema issues #18612

Open scott9380 opened 1 month ago

scott9380 commented 1 month ago

Checks

Reproducible example

        df = pl.read_excel(file_name
                           ,read_options={
                               "use_columns":[2,3,4,5]
                               "header_row":None,
                               "skip_rows":15,
                               "column_names":['is_int','is_int','is_float','is_float']
                               "schema_sample_rows":0,
                               "dtypes":{2:'int',3:'float',4:'int',5:'float'},
                               "dtype_coercion":'strict',
                           })

Log output

No response

Issue description

I cannot tell if these are fastexcel issues or polars issues.

Using the following read_options for pl.read_excel: use_columns = some subset of the columns, first two columns in the sheet are empty header_row = None (header is on two rows and fastexcel cannot handle that) skip_rows = 15

When dtype is specified as integer, but data is a float, the value is truncated. This is true for regular decimals (1.5) and Excel "integers" that are floats (1.999999999999999999)

When schema_sample_rows=0 and dtype_coercion='strict', I am still getting errors for columns that were not specified in use_columns. In other words, it still seems it tries to infer dtypes and coerce unecessary columns.

When I specify float for dtypes, it seems to ignore that and infer ints ,even when schema_sample_rows = 0. This makes polars challenging since there is a design choice to not concat ints and floats.

Expected behavior

columns other than those in use_columns should not raise coerce errors if dtype float is specified the dtype should be float floats should not silently be truncated, and Excel "integers-as-floats" should infer and/or load as integers; alternatively rounding silently would be better than truncating silently

Installed versions

``` --------Version info--------- Polars: 1.6.0 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:28:07) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel 0.11.6 fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy 1.26.4 openpyxl 3.1.2 pandas 2.2.2 pyarrow 17.0.0 pydantic pyiceberg sqlalchemy 2.0.30 torch xlsx2csv xlsxwriter ```
mcrumiller commented 1 month ago

Can you produce an example Excel sheet via write_excel or possibly link to a faulty sheet?

scott9380 commented 1 month ago

Here you go

Book7.xlsx

import polars as pl df = pl.read_excel("Book7.xlsx" ,read_options={ "use_columns":[0,1,2,3] ,"dtypes":{0:"int",1:"float",2:"int",3:"float"}

,"dtype_coercion":"strict" ## UnsupportedColumnTypeCombinationError -> 0: could not determine dtype for column b

               })

df

shape: (1, 4) x y z a i64 i64 i64 f64

1 1 2 3.0

cmdlineluser commented 1 month ago

Just some notes which may help debugging:

You can test fastexcel directly to test if it is causing the errors.

(fastexcel.read_excel("Downloads/Book7.xlsx") 
  .load_sheet(0, use_columns=[0, 1, 2, 3], dtypes={0:"int",1:"float",2:"int",3:"float"}, dtype_coercion="strict")
  .to_polars()
)
# UnsupportedColumnTypeCombinationError: unsupported column type combination: type coercion is strict and column contains {String, Float}
# Context:
#    0: could not determine dtype for column b

It is quite odd - if we rewrite your attached file, the error cannot be reproduced.

>>> pl.read_excel("Downloads/Book7.xlsx")
# shape: (2, 5)
# ┌──────┬──────┬──────┬──────┬───────┐
# │ x    ┆ y    ┆ z    ┆ a    ┆ b     │
# │ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---   │
# │ i64  ┆ i64  ┆ f64  ┆ f64  ┆ str   │
# ╞══════╪══════╪══════╪══════╪═══════╡
# │ 1    ┆ 1    ┆ 3.0  ┆ 3.0  ┆ 4.9   │
# │ null ┆ null ┆ null ┆ null ┆ tacos │
# └──────┴──────┴──────┴──────┴───────┘
pl.read_excel("Downloads/Book7.xlsx").write_excel("18612.xlsx")

fastexcel reads OK:

(fastexcel.read_excel("18612.xlsx") 
  .load_sheet(0, use_columns=[0, 1, 2, 3], dtypes={0:"int",1:"float",2:"int",3:"float"}, dtype_coercion="strict")
  .to_polars()
)
# shape: (2, 4)
# ┌──────┬──────┬──────┬──────┐
# │ x    ┆ y    ┆ z    ┆ a    │
# │ ---  ┆ ---  ┆ ---  ┆ ---  │
# │ i64  ┆ f64  ┆ i64  ┆ f64  │
# ╞══════╪══════╪══════╪══════╡
# │ 1    ┆ 1.0  ┆ 3    ┆ 3.0  │
# │ null ┆ null ┆ null ┆ null │
# └──────┴──────┴──────┴──────┘

Polars reads OK - but produces different results: (no floats, skips "null" row)

pl.read_excel("18612.xlsx", read_options={
    "use_columns": [0, 1, 2, 3],
    "dtypes": {0:"int",1:"float",2:"int",3:"float"},
    "dtype_coercion": "strict" 
})
# shape: (1, 4)
# ┌─────┬─────┬─────┬─────┐
# │ x   ┆ y   ┆ z   ┆ a   │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╡
# │ 1   ┆ 1   ┆ 3   ┆ 3   │
# └─────┴─────┴─────┴─────┘
adrivn commented 1 week ago

Having the same sort of issues.

From an Excel file with 2 columns (id: int, label: string), 150k rows out of which all have values for the first column and only ~3k have values for the second, it does not seem to recognize them, casting the entire column as pl.Null.

I attach the sample file and the results of the test:

file = "C:/Users/adrivn/Downloads/test_polars_inferring_dtypes.xlsx"
df_inferring_dtypes = pl.read_excel(file)
df_inferring_dtypes .describe()
#shape: (9, 3)
#┌────────────┬──────────────┬──────────┐
#│ statistic  ┆ id           ┆ label    │
#│ ---        ┆ ---          ┆ ---      │
#│ str        ┆ f64          ┆ f64      │
#╞════════════╪══════════════╪══════════╡
#│ count      ┆ 140049.0     ┆ 0.0      │
#│ null_count ┆ 0.0          ┆ 140049.0 │
#│ mean       ┆ 79976.009832 ┆ null     │
#│ std        ┆ 40428.7914   ┆ null     │
#│ min        ┆ 9953.0       ┆ null     │
#│ 25%        ┆ 44964.0      ┆ null     │
#│ 50%        ┆ 79976.0      ┆ null     │
#│ 75%        ┆ 114988.0     ┆ null     │
#│ max        ┆ 150000.0     ┆ null     │
#└────────────┴──────────────┴──────────┘

As you can see, the file is read on its entirety but the schema inferring apparently stops after a certain point, even when coercing polars to use a high (millions) number of samples via the schema_sample_rows option. test_polars_inferring_dtypes.xlsx

D3SL commented 3 days ago

Fastexcel uses somewhere around the first third of the file for inference and silently defaults to null if it can't figure out what data type a column is. They're implementing a fix for this where columns will default to string rather than be silently deleted but it will only be in the version after this next one.

In the meantime Polars should absolutely do something, even if it's just some kind of notification, because anyone using any fastexcel based version (until that fix is implemented at their end) will risk unpredictable and completely silent data loss.