pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.22k stars 1.84k forks source link

Polars ignoring rows that are empty in Excel #18250

Open Rashik-raj opened 3 weeks ago

Rashik-raj commented 3 weeks ago

Checks

Reproducible example

import polars as pl
import pandas as pd

def read_excel_using_polars(path):
    return pl.read_excel(source=path)

def read_csv_using_polars(path):
    return pl.read_csv(source=path)

def read_excel_using_pandas(path):
    return pd.read_excel(io=path)

def read_csv_using_pandas(path):
    return pd.read_csv(path)

pl_df_excel = read_excel_using_polars("test_empty_rows.xlsx")
pd_df_excel =read_excel_using_pandas("test_empty_rows.xlsx")

pl_df_csv = read_csv_using_polars("test_empty_rows.csv")
pd_df_csv = read_csv_using_pandas("test_empty_rows.csv")

print("Reading excel from polars", end="\n\n")
print(pl_df_excel)

print("\nReading excel from pandas", end="\n\n")
print(pd_df_excel)

print("\nReading csv from polars", end="\n\n")
print(pl_df_csv)

print("\nReading csv from pandas", end="\n\n")
print(pd_df_csv)

test_empty_rows.xlsx test_empty_rows.csv

Log output

Reading excel from polars

shape: (8, 4)
┌──────┬──────┬──────┬─────┐
│ a    ┆ b    ┆ c    ┆ d   │
│ ---  ┆ ---  ┆ ---  ┆ --- │
│ i64  ┆ i64  ┆ null ┆ i64 │
╞══════╪══════╪══════╪═════╡
│ 1    ┆ 1    ┆ null ┆ 1   │
│ 2    ┆ null ┆ null ┆ 2   │
│ 3    ┆ 3    ┆ null ┆ 3   │
│ null ┆ 4    ┆ null ┆ 4   │
│ 5    ┆ 5    ┆ null ┆ 5   │
│ 6    ┆ 6    ┆ null ┆ 6   │
│ 9    ┆ 9    ┆ null ┆ 9   │
│ 10   ┆ 10   ┆ null ┆ 0   │
└──────┴──────┴──────┴─────┘

Reading excel from pandas

      a     b   c    d
0   1.0   1.0 NaN  1.0
1   2.0   NaN NaN  2.0
2   3.0   3.0 NaN  3.0
3   NaN   4.0 NaN  4.0
4   5.0   5.0 NaN  5.0
5   6.0   6.0 NaN  6.0
6   NaN   NaN NaN  NaN
7   NaN   NaN NaN  NaN
8   9.0   9.0 NaN  9.0
9  10.0  10.0 NaN  0.0

Reading csv from polars

shape: (10, 4)
┌──────┬──────┬──────┬──────┐
│ a    ┆ b    ┆ c    ┆ d    │
│ ---  ┆ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  ┆ i64  │
╞══════╪══════╪══════╪══════╡
│ 1    ┆ 1    ┆ null ┆ 1    │
│ 2    ┆ null ┆ null ┆ 2    │
│ 3    ┆ 3    ┆ null ┆ 3    │
│ null ┆ 4    ┆ null ┆ 4    │
│ 5    ┆ 5    ┆ null ┆ 5    │
│ 6    ┆ 6    ┆ null ┆ 6    │
│ null ┆ null ┆ null ┆ null │
│ null ┆ null ┆ null ┆ null │
│ 9    ┆ 9    ┆ null ┆ 9    │
│ 10   ┆ 10   ┆ null ┆ 0    │
└──────┴──────┴──────┴──────┘

Reading csv from pandas

      a     b   c    d
0   1.0   1.0 NaN  1.0
1   2.0   NaN NaN  2.0
2   3.0   3.0 NaN  3.0
3   NaN   4.0 NaN  4.0
4   5.0   5.0 NaN  5.0
5   6.0   6.0 NaN  6.0
6   NaN   NaN NaN  NaN
7   NaN   NaN NaN  NaN
8   9.0   9.0 NaN  9.0
9  10.0  10.0 NaN  0.0

Issue description

We were using pandas to import files in our FastApi server that imports Excel, CSV, or XPT files. Due to lack of performance, we switched to Polars (had issue running in uvicorn as it is thread based server, made a workaround). We noticed that Polars is dropping rows that are entirely empty for Excel files.

To double check, I tested with pandas as well, and ensured that pandas do not drop empty rows. I went through polars documentation, but with no luck. performed some debugging in polars package and found _drop_null_data to be the culprit. I had monkeypatched _drop_null_data to return entire dataframe in an attempt to not loose empty rows as a workaround.

import polars as pl

def _custom_drop_null_data(df, raise_if_empty):
    return df

pl.io.spreadsheet.functions._drop_null_data = _custom_drop_null_data
df = pl.read_excel(source="test_empty_rows.xlsx")  # we get empty rows as well

Also polars seems to be dropping empty rows only for Excel files which you can see from the reproducible sample. So to keep backward compatibility just in case it is desired feature, I am proposing to use drop_empty_rows parameter in _drop_null_data that defaults to True, and can be switched when reading Excel files as pl.read_excel(source=path, drop_empty_rows=False). I have also verified that this happens with every engine of Excel.

This is how patched _drop_null_data will look like:

def _drop_null_data(df: pl.DataFrame, *, raise_if_empty: bool, drop_empty_rows: bool = True) -> pl.DataFrame:
    """
    If DataFrame contains columns/rows that contain only nulls, drop them.

    If `drop_empty_rows` is set to `False`, empty rows are not dropped.
    """
    null_cols = []
    for col_name in df.columns:
        # note that if multiple unnamed columns are found then all but the first one
        # will be named as "_duplicated_{n}" (or "__UNNAMED__{n}" from calamine)
        if col_name == "" or re.match(r"(_duplicated_|__UNNAMED__)\d+$", col_name):
            col = df[col_name]
            if (
                col.dtype == Null
                or col.null_count() == len(df)
                or (
                    col.dtype in NUMERIC_DTYPES
                    and col.replace(0, None).null_count() == len(df)
                )
            ):
                null_cols.append(col_name)
    if null_cols:
        df = df.drop(*null_cols)

    if len(df) == 0 and len(df.columns) == 0:
        return _empty_frame(raise_if_empty)
    if drop_empty_rows:
        return df.filter(~F.all_horizontal(F.all().is_null()))
    return df

Expected behavior

To not drop empty rows conditionally based on new flag drop_empty_rows.

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: macOS-14.6.1-arm64-arm-64bit Python: 3.12.0 (v3.12.0:0fb18b02c8, Oct 2 2023, 09:45:56) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager 1.1.0 cloudpickle 3.0.0 connectorx 0.3.3 deltalake 0.19.0 fastexcel 0.11.5 fsspec 2024.6.1 gevent 24.2.1 great_tables 0.10.0 hvplot 0.10.0 matplotlib 3.9.2 nest_asyncio 1.6.0 numpy 2.0.1 openpyxl 3.1.5 pandas 2.2.2 pyarrow 17.0.0 pydantic 2.8.2 pyiceberg sqlalchemy 2.0.32 torch xlsx2csv 0.8.3 xlsxwriter 3.2.0 ```
Rashik-raj commented 3 weeks ago

Found out that there is open ticket 14874. Somehow I did not find this 2 week back.