ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.1k stars 79 forks source link

Own function use in `sfl()` broken in dplyr 1.0.6 #664

Closed msberends closed 3 years ago

msberends commented 3 years ago

For a class mic in our AMR package, we have this:

get_skimmers.mic <- function(column) {
  skimr::sfl(
    skim_type = "mic",
    min = ~min(., na.rm = TRUE),
    max = ~max(., na.rm = TRUE),
    median = ~stats::median(., na.rm = TRUE),
    n_unique = ~pm_n_distinct(., na.rm = TRUE),
    hist_log2 = ~skimr::inline_hist(log2(stats::na.omit(.)))
  )
}

This worked great until the update to dplyr 1.0.6, raising this error:

Error: Problem with summarise() column skimmed. i skimmed = purrr::map2(...). x Problem with summarise() input ..1. i ..1 = dplyr::across(variable_names, mangled_skimmers$funs). x could not find function "pm_n_distinct" Run rlang::last_error() to see where the error occurred.

The pm_n_distinct() is a function in the AMR package namespace, but not exported for users (it's actually a copy of dplyr::n_distinct() in base R language, but that doesn't matter for this issue).

So it seems that functions from a package namespace cannot be used anymore in sfl() as it seems to use across() internally. I think this might be due to this line of the changelog of dplyr 1.0.6:

across() now inlines lambda-formulas. This is slightly more performant and will allow more optimisations in the future.

msberends commented 3 years ago

As expected, replacing

n_unique = ~pm_n_distinct(., na.rm = TRUE),

with

n_unique = ~length(unique(stats::na.omit(.))),

solves it, but this workaround should not be needed.

michaelquinn32 commented 3 years ago

Thanks for reporting that!

It looks like the dplyr change is stopping you from using an internal function. What would be the consequences of exporting that function? Does it resolve the issue?

Otherwise, I think this is a dplyr issue. I'm not sure if there is anything that we can do with skimr to resolve it. Mind cross linking there?

msberends commented 3 years ago

No, I think this is caused by the way skimr handles functions set in sfl(), namely with dplyr’s across(). And since dplyr 1.0.6, usage of internal functions inside sfl() do not work anymore, and I think this is related to an update of across() as they state in their changelog.

So my guess is that the skimr code needs to be updated to work with this new way of using across(), but I’m also not sure...

Our error log on CRAN since a couple of days: https://cran.r-project.org/web/checks/check_results_AMR.html

msberends commented 3 years ago

👍👍 Thanks for the ref

michaelquinn32 commented 3 years ago

No problem. I'm not sure how to fix this in skimr, so hopefully someone on the dplyr team can suggest.

msberends commented 3 years ago

Can't reproduce it (see below, dplyr 1.0.6, skimr 2.1.3). Very strange! Maybe it was just an AMR error, but I still don't understand why across() would suddenly be unable to find a function in the same package namespace, while in this reprex it all seems to work. And I know for sure it raises an error since dplyr 1.0.6, so I also don't think it's an skimr bug...

Closing this for now.


x <- factor(letters[1:10], ordered = TRUE)
class(x) <- c("newclass", class(x))

length_unique <- function(x) {
  length(unique(x))
}

get_skimmers.newclass <- function(column) {
  skimr::sfl(
    skim_type = "newclass",
    test = ~length_unique(.)
  )
}

skimr::skim(x)
Data summary
Name x
Number of rows 10
Number of columns 1
_______________________
Column type frequency:
newclass 1
________________________
Group variables None

Data summary

Variable type: newclass

skim_variable n_missing complete_rate test
data 0 1 10

Created on 2021-05-11 by the reprex package (v1.0.0)