sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
2.07k stars 206 forks source link

create_report crashes for almost empty string columns #963

Open AndrejIring opened 1 year ago

AndrejIring commented 1 year ago

Describe the bug A clear and concise description of what the bug is. create_report crashes for large DataFrames with almost empty string columns with the error 'Series' object has no attribute 'len'

After investigation, I found out that an error occurs in the calculation of the mean length of the elements in method _calc_nom_stats. It is caused by a partition containing only NaN values which are in the method nom_comps dropped which causes an empty Dask partition to be created. Afterward when compute is called an error is raised.

To Reproduce

import pandas as pd
import numpy as np
from dataprep.eda import create_report

df = pd.DataFrame(np.random.randint(-100,100, (300000,100)))
df.loc[0, "almost_empty_col"] = "single value"
report = create_report(df)

Expected behavior A clear and concise description of what you expected to happen.

During the create_report calculations empty partitions should be handled. Particularly in method nom_comps empty partitions should be dropped after calling srs = srs.dropna()

Desktop (please complete the following information):

dovahcrow commented 1 year ago

Hi @AndrejIring thanks for the bug report and the detailed analysis of the reason! I'll take a look into the fix.