machow / siuba

Python library for using dplyr like syntax with pandas and SQL
https://siuba.org
MIT License
1.14k stars 48 forks source link

group_by >> summarize on an empty df #467

Open nathanjmcdougall opened 1 year ago

nathanjmcdougall commented 1 year ago

Consider the following:

from siuba import _, group_by, summarize, 
DataFrame.from_dict(dict(x=[], y=[])) >> group_by(_.x) >> summarize(z=_.y.sum())

This doesn't add the column z:

x y

I would have expected

x z
machow commented 1 year ago

Thanks for reporting. Digging a bit into dplyr, it seems like some it has careful handling of this case:

For example:

library(dplyr)

df <- tibble(a = integer(), b = integer())

# in all the examples below, the value is discarded (e.g. 1, 1.2 get thrown away)

# c is a int
df %>% group_by(a) %>% summarize(c = 1)

# c is a dbl
df %>% group_by(a) %>% summarize(c = 1.2)

# c is a int, since sum(a) is 0
df %>% group_by(a) %>% summarize(c = sum(a))
machow commented 1 year ago

Note also that the experimental behavior of summarize being able to return 0 or > 1 rows is deprecated (and a new function tentatively called reframe will handle that behavior!).

It seems like the code above still works on the main branch of dplyr, but this case now prints a warning:


df %>% group_by(a) %>% summarize(c = integer())

output:

Warning message:
Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an
  ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
nathanjmcdougall commented 1 year ago

Ah, this is quite an interesting way of looking at it.

"A grouped summarise always return 1 row per group" But what if there are no groups? Does this violate the 1 row per group rule? I would argue that the answer is no rather than yes.

Regarding this process:

It seems to me that there are no groups to group by, so there is no empty data to summarize with a function like sum, and no resulting array to set to a correct type, etc. Rather than passing an empty list of values to sum and returning 0, it's that we don't even need to run any summarization because there's no groups.

It seems that most summarizing methods in pandas like sum, all, mean etc. all accept vacuous/empty inputs and will return 0, True, NaN respectively, i.e. one value, not zero. This means that in most cases I would need to explicitly handle the empty dataframes separately to ensure that the result of a group_by operation has the same column structure at the end of the process as for non-empty dataframes.

If siuba needs to match dplyr behaviour on this point, then is there the possibility of adding an optional argument to the summarize function like __fail_empty: bool = True? Or some other work around? In any case, I feel like an explicit warning would be helpful when this existing functionality kicks in.