group_by not always working

ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics

https://docs.ropensci.org/skimr

1.1k stars 78 forks source link

group_by not always working #736

Open DaZaM82 opened 1 year ago

DaZaM82 commented 1 year ago

For some data frames I get an error when applying group_by before skim(). Doesn’t seem to depend on the grouping variable type.

this issue doesn’t come up for iris though where it works.

I’m not sure how to provide a reprex for this. Is this a known issue? Any way to resolve it?

DaZaM82 commented 1 year ago

Some additional info...

If I run the code up to group_by() and nothing afterwards, then that works fine. Error is thrown by the skimr function

I tried to see if any of the groupings had zero length and found that none do, so not sure what is going on:

elinw commented 1 year ago

I don't think the issue is the number of rows. It's with the classes. Is it possible to isolate which data type is causing the problem by filtering prior to skimming? Also, is the group_by() necessary to cause the error?

DaZaM82 commented 1 year ago

Hi @elinw
Thanks for that. I was able to identify the problem following your suggestion. The error is when there are date columns that are all NA.

So now I can generate a reprex: iris %>% mutate(date = NA_Date_) %>% group_by(Species) %>% skimr::skim()

This seems like a bug to me?

elinw commented 1 year ago

For sure a bug. Thanks for the report!

elinw commented 1 year ago

@michaelquinn32 When all values are NA and na.rm = TRUE (which we use everywhere) the functions like min() return Inf with a warning. . Whereas if na.rm = FALSE is set they return NA. mean() and others, however, return NA without a warning. So I think we (a) should decide what should be returned when skimming such data and (b) need to think about handling the warning. I suspect the Inf is possibly the ultimate cause of the issue reported.

elinw commented 1 year ago

Okay I take it back, because there is no such thing as NA_Date_ as far as I can tell, unless it is defined in a package. So I think that issue is getting absorbed into the pipe chain and then causing the errors. You can see this if you try

iris %>%  mutate(date = NA_date_)

So then the pipe ends up passing nothing along to the final steps.

Update: So lubridate does does define NA_Date and loading it first solves the error, though not the warnings.

library(lubridate)
iris %>%  mutate(date = NA_Date_) |> skimr::skim()

Warning messages: 1: In min.default(c(NAreal, NAreal, NAreal, NAreal, NAreal, : no non-missing arguments to min; returning Inf 2: In max.default(c(NAreal, NAreal, NAreal, NAreal, NAreal, : no non-missing arguments to max; returning -Inf

I guess the question is, do we want to silence the warnings? Or would we want to handle like a numeric since I don't think min and max are defined for the date class. Consistency with Numeric would return NA for both min and max.

DaZaM82 commented 7 months ago

iris %>% mutate(date = NAdate)

@elinw NA_Date_ is a valid object even without lubridate. Your above code throws an error because you have a lower case d where it should be a capital D. If you change it from NA_date_ to NA_Date_ then it works.