Open scott9380 opened 1 month ago
Can you produce an example Excel sheet via write_excel
or possibly link to a faulty sheet?
Here you go
import polars as pl df = pl.read_excel("Book7.xlsx" ,read_options={ "use_columns":[0,1,2,3] ,"dtypes":{0:"int",1:"float",2:"int",3:"float"}
})
df
shape: (1, 4) x y z a i64 i64 i64 f64
1 1 2 3.0
Just some notes which may help debugging:
You can test fastexcel directly to test if it is causing the errors.
(fastexcel.read_excel("Downloads/Book7.xlsx")
.load_sheet(0, use_columns=[0, 1, 2, 3], dtypes={0:"int",1:"float",2:"int",3:"float"}, dtype_coercion="strict")
.to_polars()
)
# UnsupportedColumnTypeCombinationError: unsupported column type combination: type coercion is strict and column contains {String, Float}
# Context:
# 0: could not determine dtype for column b
It is quite odd - if we rewrite your attached file, the error cannot be reproduced.
>>> pl.read_excel("Downloads/Book7.xlsx")
# shape: (2, 5)
# ┌──────┬──────┬──────┬──────┬───────┐
# │ x ┆ y ┆ z ┆ a ┆ b │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ f64 ┆ f64 ┆ str │
# ╞══════╪══════╪══════╪══════╪═══════╡
# │ 1 ┆ 1 ┆ 3.0 ┆ 3.0 ┆ 4.9 │
# │ null ┆ null ┆ null ┆ null ┆ tacos │
# └──────┴──────┴──────┴──────┴───────┘
pl.read_excel("Downloads/Book7.xlsx").write_excel("18612.xlsx")
fastexcel reads OK:
(fastexcel.read_excel("18612.xlsx")
.load_sheet(0, use_columns=[0, 1, 2, 3], dtypes={0:"int",1:"float",2:"int",3:"float"}, dtype_coercion="strict")
.to_polars()
)
# shape: (2, 4)
# ┌──────┬──────┬──────┬──────┐
# │ x ┆ y ┆ z ┆ a │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ f64 ┆ i64 ┆ f64 │
# ╞══════╪══════╪══════╪══════╡
# │ 1 ┆ 1.0 ┆ 3 ┆ 3.0 │
# │ null ┆ null ┆ null ┆ null │
# └──────┴──────┴──────┴──────┘
Polars reads OK - but produces different results: (no floats, skips "null" row)
pl.read_excel("18612.xlsx", read_options={
"use_columns": [0, 1, 2, 3],
"dtypes": {0:"int",1:"float",2:"int",3:"float"},
"dtype_coercion": "strict"
})
# shape: (1, 4)
# ┌─────┬─────┬─────┬─────┐
# │ x ┆ y ┆ z ┆ a │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╡
# │ 1 ┆ 1 ┆ 3 ┆ 3 │
# └─────┴─────┴─────┴─────┘
Having the same sort of issues.
From an Excel file with 2 columns (id: int, label: string), 150k rows out of which all have values for the first column and only ~3k have values for the second, it does not seem to recognize them, casting the entire column as pl.Null
.
I attach the sample file and the results of the test:
file = "C:/Users/adrivn/Downloads/test_polars_inferring_dtypes.xlsx"
df_inferring_dtypes = pl.read_excel(file)
df_inferring_dtypes .describe()
#shape: (9, 3)
#┌────────────┬──────────────┬──────────┐
#│ statistic ┆ id ┆ label │
#│ --- ┆ --- ┆ --- │
#│ str ┆ f64 ┆ f64 │
#╞════════════╪══════════════╪══════════╡
#│ count ┆ 140049.0 ┆ 0.0 │
#│ null_count ┆ 0.0 ┆ 140049.0 │
#│ mean ┆ 79976.009832 ┆ null │
#│ std ┆ 40428.7914 ┆ null │
#│ min ┆ 9953.0 ┆ null │
#│ 25% ┆ 44964.0 ┆ null │
#│ 50% ┆ 79976.0 ┆ null │
#│ 75% ┆ 114988.0 ┆ null │
#│ max ┆ 150000.0 ┆ null │
#└────────────┴──────────────┴──────────┘
As you can see, the file is read on its entirety but the schema inferring apparently stops after a certain point, even when coercing polars to use a high (millions) number of samples via the schema_sample_rows
option.
test_polars_inferring_dtypes.xlsx
Fastexcel uses somewhere around the first third of the file for inference and silently defaults to null if it can't figure out what data type a column is. They're implementing a fix for this where columns will default to string rather than be silently deleted but it will only be in the version after this next one.
In the meantime Polars should absolutely do something, even if it's just some kind of notification, because anyone using any fastexcel based version (until that fix is implemented at their end) will risk unpredictable and completely silent data loss.
Checks
Reproducible example
Log output
No response
Issue description
I cannot tell if these are fastexcel issues or polars issues.
Using the following read_options for pl.read_excel: use_columns = some subset of the columns, first two columns in the sheet are empty header_row = None (header is on two rows and fastexcel cannot handle that) skip_rows = 15
When dtype is specified as integer, but data is a float, the value is truncated. This is true for regular decimals (1.5) and Excel "integers" that are floats (1.999999999999999999)
When schema_sample_rows=0 and dtype_coercion='strict', I am still getting errors for columns that were not specified in use_columns. In other words, it still seems it tries to infer dtypes and coerce unecessary columns.
When I specify float for dtypes, it seems to ignore that and infer ints ,even when schema_sample_rows = 0. This makes polars challenging since there is a design choice to not concat ints and floats.
Expected behavior
columns other than those in use_columns should not raise coerce errors if dtype float is specified the dtype should be float floats should not silently be truncated, and Excel "integers-as-floats" should infer and/or load as integers; alternatively rounding silently would be better than truncating silently
Installed versions