Describe the bug
A clear and concise description of what the bug is.
create_report crashes for large DataFrames with almost empty string columns with the error 'Series' object has no attribute 'len'
After investigation, I found out that an error occurs in the calculation of the mean length of the elements in method _calc_nom_stats. It is caused by a partition containing only NaN values which are in the method nom_comps dropped which causes an empty Dask partition to be created. Afterward when compute is called an error is raised.
To Reproduce
import pandas as pd
import numpy as np
from dataprep.eda import create_report
df = pd.DataFrame(np.random.randint(-100,100, (300000,100)))
df.loc[0, "almost_empty_col"] = "single value"
report = create_report(df)
Expected behavior
A clear and concise description of what you expected to happen.
During the create_report calculations empty partitions should be handled. Particularly in method nom_comps empty partitions should be dropped after calling srs = srs.dropna()
Desktop (please complete the following information):
Describe the bug A clear and concise description of what the bug is.
create_report
crashes for large DataFrames with almost emptystring
columns with the error'Series' object has no attribute 'len'
After investigation, I found out that an error occurs in the calculation of the mean length of the elements in method
_calc_nom_stats
. It is caused by a partition containing onlyNaN
values which are in the methodnom_comps
dropped which causes an empty Dask partition to be created. Afterward when compute is called an error is raised.To Reproduce
Expected behavior A clear and concise description of what you expected to happen.
During the
create_report
calculations empty partitions should be handled. Particularly in methodnom_comps
empty partitions should be dropped after callingsrs = srs.dropna()
Desktop (please complete the following information):