Reading excel from polars
shape: (8, 4)
┌──────┬──────┬──────┬─────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ null ┆ i64 │
╞══════╪══════╪══════╪═════╡
│ 1 ┆ 1 ┆ null ┆ 1 │
│ 2 ┆ null ┆ null ┆ 2 │
│ 3 ┆ 3 ┆ null ┆ 3 │
│ null ┆ 4 ┆ null ┆ 4 │
│ 5 ┆ 5 ┆ null ┆ 5 │
│ 6 ┆ 6 ┆ null ┆ 6 │
│ 9 ┆ 9 ┆ null ┆ 9 │
│ 10 ┆ 10 ┆ null ┆ 0 │
└──────┴──────┴──────┴─────┘
Reading excel from pandas
a b c d
0 1.0 1.0 NaN 1.0
1 2.0 NaN NaN 2.0
2 3.0 3.0 NaN 3.0
3 NaN 4.0 NaN 4.0
4 5.0 5.0 NaN 5.0
5 6.0 6.0 NaN 6.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 9.0 9.0 NaN 9.0
9 10.0 10.0 NaN 0.0
Reading csv from polars
shape: (10, 4)
┌──────┬──────┬──────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╡
│ 1 ┆ 1 ┆ null ┆ 1 │
│ 2 ┆ null ┆ null ┆ 2 │
│ 3 ┆ 3 ┆ null ┆ 3 │
│ null ┆ 4 ┆ null ┆ 4 │
│ 5 ┆ 5 ┆ null ┆ 5 │
│ 6 ┆ 6 ┆ null ┆ 6 │
│ null ┆ null ┆ null ┆ null │
│ null ┆ null ┆ null ┆ null │
│ 9 ┆ 9 ┆ null ┆ 9 │
│ 10 ┆ 10 ┆ null ┆ 0 │
└──────┴──────┴──────┴──────┘
Reading csv from pandas
a b c d
0 1.0 1.0 NaN 1.0
1 2.0 NaN NaN 2.0
2 3.0 3.0 NaN 3.0
3 NaN 4.0 NaN 4.0
4 5.0 5.0 NaN 5.0
5 6.0 6.0 NaN 6.0
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 9.0 9.0 NaN 9.0
9 10.0 10.0 NaN 0.0
Issue description
We were using pandas to import files in our FastApi server that imports Excel, CSV, or XPT files. Due to lack of performance, we switched to Polars (had issue running in uvicorn as it is thread based server, made a workaround). We noticed that Polars is dropping rows that are entirely empty for Excel files.
To double check, I tested with pandas as well, and ensured that pandas do not drop empty rows. I went through polars documentation, but with no luck. performed some debugging in polars package and found _drop_null_data to be the culprit. I had monkeypatched _drop_null_data to return entire dataframe in an attempt to not loose empty rows as a workaround.
import polars as pl
def _custom_drop_null_data(df, raise_if_empty):
return df
pl.io.spreadsheet.functions._drop_null_data = _custom_drop_null_data
df = pl.read_excel(source="test_empty_rows.xlsx") # we get empty rows as well
Also polars seems to be dropping empty rows only for Excel files which you can see from the reproducible sample. So to keep backward compatibility just in case it is desired feature, I am proposing to use drop_empty_rows parameter in _drop_null_data that defaults to True, and can be switched when reading Excel files as pl.read_excel(source=path, drop_empty_rows=False). I have also verified that this happens with every engine of Excel.
This is how patched _drop_null_data will look like:
def _drop_null_data(df: pl.DataFrame, *, raise_if_empty: bool, drop_empty_rows: bool = True) -> pl.DataFrame:
"""
If DataFrame contains columns/rows that contain only nulls, drop them.
If `drop_empty_rows` is set to `False`, empty rows are not dropped.
"""
null_cols = []
for col_name in df.columns:
# note that if multiple unnamed columns are found then all but the first one
# will be named as "_duplicated_{n}" (or "__UNNAMED__{n}" from calamine)
if col_name == "" or re.match(r"(_duplicated_|__UNNAMED__)\d+$", col_name):
col = df[col_name]
if (
col.dtype == Null
or col.null_count() == len(df)
or (
col.dtype in NUMERIC_DTYPES
and col.replace(0, None).null_count() == len(df)
)
):
null_cols.append(col_name)
if null_cols:
df = df.drop(*null_cols)
if len(df) == 0 and len(df.columns) == 0:
return _empty_frame(raise_if_empty)
if drop_empty_rows:
return df.filter(~F.all_horizontal(F.all().is_null()))
return df
Expected behavior
To not drop empty rows conditionally based on new flag drop_empty_rows.
Checks
Reproducible example
test_empty_rows.xlsx test_empty_rows.csv
Log output
Issue description
We were using pandas to import files in our FastApi server that imports Excel, CSV, or XPT files. Due to lack of performance, we switched to Polars (had issue running in uvicorn as it is thread based server, made a workaround). We noticed that Polars is dropping rows that are entirely empty for Excel files.
To double check, I tested with pandas as well, and ensured that pandas do not drop empty rows. I went through polars documentation, but with no luck. performed some debugging in polars package and found
_drop_null_data
to be the culprit. I had monkeypatched_drop_null_data
to return entire dataframe in an attempt to not loose empty rows as a workaround.Also polars seems to be dropping empty rows only for Excel files which you can see from the reproducible sample. So to keep backward compatibility just in case it is desired feature, I am proposing to use
drop_empty_rows
parameter in_drop_null_data
that defaults toTrue
, and can be switched when reading Excel files aspl.read_excel(source=path, drop_empty_rows=False)
. I have also verified that this happens with every engine of Excel.This is how patched
_drop_null_data
will look like:Expected behavior
To not drop empty rows conditionally based on new flag
drop_empty_rows
.Installed versions