machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

Grouped summarize fails when a grouping col has NAs and < 2 other levels #458

Closed machow closed 1 year ago

machow commented 1 year ago

For a grouped summarize, when a grouping column...

AFAICT setting groupby(..., dropna=False) resolves this (cf https://github.com/machow/siuba/issues/251)

Example: all NA levels raises an error, since grouping columns on result and index

cars6 = cars.copy()
cars6["cyl"] = np.nan

cars6 >> group_by(_.cyl, _.hp) >> summarize(res = _.mpg.mean())

Raises

ValueError: cannot insert cyl, already exists
Full traceback ```python ValueError Traceback (most recent call last) Cell In [23], line 4 1 cars6 = cars.copy() 2 cars6["cyl"] = np.nan ----> 4 cars6 >> group_by(_.cyl, _.hp) >> summarize(res = _.mpg.mean()) File ~/.virtualenvs/siuba/lib/python3.8/site-packages/siuba/siu/calls.py:214, in Call.__rrshift__(self, x) 210 if isinstance(strip_symbolic(x), (Call)): 211 # only allow non-calls (i.e. data) on the left. 212 raise TypeError() --> 214 return self(x) File ~/.virtualenvs/siuba/lib/python3.8/site-packages/siuba/siu/calls.py:189, in Call.__call__(self, x) 187 return operator.getitem(inst, *rest) 188 elif self.func == "__call__": --> 189 return getattr(inst, self.func)(*rest, **kwargs) 191 # in normal case, get method to call, and then call it 192 f_op = getattr(operator, self.func) File ~/.pyenv/versions/3.8.12/lib/python3.8/functools.py:875, in singledispatch..wrapper(*args, **kw) 871 if not args: 872 raise TypeError(f'{funcname} requires at least ' 873 '1 positional argument') --> 875 return dispatch(args[0].__class__)(*args, **kw) File ~/.virtualenvs/siuba/lib/python3.8/site-packages/siuba/dply/verbs.py:564, in _summarize(__data, *args, **kwargs) 561 df = __data.apply(df_summarize, *args, **kwargs) 563 group_by_lvls = list(range(df.index.nlevels - 1)) --> 564 out = df.reset_index(group_by_lvls) 565 out.index = pd.RangeIndex(df.shape[0]) 567 return out File ~/.virtualenvs/siuba/lib/python3.8/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, **kwargs) 325 if len(args) > num_allow_args: 326 warnings.warn( 327 msg.format(arguments=_format_argument_list(allow_args)), 328 FutureWarning, 329 stacklevel=find_stack_level(), 330 ) --> 331 return func(*args, **kwargs) File ~/.virtualenvs/siuba/lib/python3.8/site-packages/pandas/core/frame.py:6350, in DataFrame.reset_index(self, level, drop, inplace, col_level, col_fill, allow_duplicates, names) 6344 if lab is not None: 6345 # if we have the codes, extract the values with a mask 6346 level_values = algorithms.take( 6347 level_values, lab, allow_fill=True, fill_value=lev._na_value 6348 ) -> 6350 new_obj.insert( 6351 0, 6352 name, 6353 level_values, 6354 allow_duplicates=allow_duplicates, 6355 ) 6357 new_obj.index = new_index 6358 if not inplace: File ~/.virtualenvs/siuba/lib/python3.8/site-packages/pandas/core/frame.py:4806, in DataFrame.insert(self, loc, column, value, allow_duplicates) 4800 raise ValueError( 4801 "Cannot specify 'allow_duplicates=True' when " 4802 "'self.flags.allows_duplicate_labels' is False." 4803 ) 4804 if not allow_duplicates and column in self.columns: 4805 # Should this be a different kind of error?? -> 4806 raise ValueError(f"cannot insert {column}, already exists") 4807 if not isinstance(loc, int): 4808 raise TypeError("loc must be int") ValueError: cannot insert cyl, already exists ```

Example: 1 non NA level outputs a table w/o grouping columns

cars5 = cars.copy()
cars5["cyl"] = [1] + [np.nan] * (len(cars) - 1)

cars5  >> group_by(_.cyl, _.hp) >> summarize(res = _.mpg.mean())

Output

Note there's no cyl or hp column on the result

image
machow commented 1 year ago

Addressed in v0.4.2