ropensci / skimr

A frictionless, pipeable approach to dealing with summary statistics
https://docs.ropensci.org/skimr
1.12k stars 79 forks source link

Error when a group is all infinite #745

Open lazappi opened 4 months ago

lazappi commented 4 months ago

Hi

I just came across this error when grouping results in a variable having only infinite values. It seems to be coming from a {glue} call but I suspect the root error may be something to do with calculating statistics and then processing that error message.

library(dplyr)
library(skimr)

df <- data.frame(group = c("A", "B"), a = c(1, Inf))

# Skimming the whole data frame works (with a  warning)
skim(df)
#> Warning: There was 1 warning in `dplyr::summarize()`.
#> ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
#>   mangled_skimmers$funs)`.
#> ℹ In group 0: .
#> Caused by warning:
#> ! There was 1 warning in `dplyr::summarize()`.
#> ℹ In argument: `dplyr::across(tidyselect::any_of(variable_names),
#>   mangled_skimmers$funs)`.
#> Caused by warning in `inline_hist()`:
#> ! Variable contains Inf or -Inf value(s) that were converted to NA.
Name df
Number of rows 2
Number of columns 2
_______________________
Column type frequency:
character 1
numeric 1
________________________
Group variables None

Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
group 0 1 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
a 0 1 Inf NaN 1 Inf Inf Inf Inf ▁▁▇▁▁
# Grouping then skimming gives an error
df |> group_by(group) |> skim()
#> Error:
#> ! Failed to evaluate glue component {label}
#> Caused by error in `vapply()`:
#> ! values must be length 1,
#>  but FUN(X[[1]]) result is length 0

Created on 2024-07-22 with reprex v2.1.1

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.4.0 (2024-04-24) #> os macOS Sonoma 14.5 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Berlin #> date 2024-07-22 #> pandoc 3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> ! package * version date (UTC) lib source #> P base64enc 0.1-3 2015-07-28 [?] CRAN (R 4.4.0) #> P cli 3.6.3 2024-06-21 [?] CRAN (R 4.4.0) #> P digest 0.6.36 2024-06-23 [?] CRAN (R 4.4.0) #> P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.4.0) #> P evaluate 0.24.0 2024-06-10 [?] CRAN (R 4.4.0) #> P fansi 1.0.6 2023-12-08 [?] CRAN (R 4.4.0) #> P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.4.0) #> P fs 1.6.4 2024-04-25 [?] CRAN (R 4.4.0) #> P generics 0.1.3 2022-07-05 [?] CRAN (R 4.4.0) #> P glue 1.7.0 2024-01-09 [?] CRAN (R 4.4.0) #> P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.4.0) #> P jsonlite 1.8.8 2023-12-04 [?] CRAN (R 4.4.0) #> P knitr 1.48 2024-07-07 [?] CRAN (R 4.4.0) #> P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.4.0) #> P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.4.0) #> P pillar 1.9.0 2023-03-22 [?] CRAN (R 4.4.0) #> P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.4.0) #> P purrr 1.0.2 2023-08-10 [?] CRAN (R 4.4.0) #> P R6 2.5.1 2021-08-19 [?] CRAN (R 4.4.0) #> P repr 1.1.7 2024-03-22 [?] CRAN (R 4.4.0) #> P reprex 2.1.1 2024-07-06 [?] CRAN (R 4.4.0) #> P rlang 1.1.4 2024-06-04 [?] CRAN (R 4.4.0) #> P rmarkdown 2.27 2024-05-17 [?] CRAN (R 4.4.0) #> P rstudioapi 0.16.0 2024-03-24 [?] CRAN (R 4.4.0) #> P sessioninfo 1.2.2 2021-12-06 [?] CRAN (R 4.4.0) #> P skimr * 2.1.5 2022-12-23 [?] CRAN (R 4.4.0) #> P stringi 1.8.4 2024-05-06 [?] CRAN (R 4.4.0) #> P stringr 1.5.1 2023-11-14 [?] CRAN (R 4.4.0) #> P tibble 3.2.1 2023-03-20 [?] CRAN (R 4.4.0) #> P tidyr 1.3.1 2024-01-24 [?] CRAN (R 4.4.0) #> P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.4.0) #> P utf8 1.2.4 2023-10-22 [?] CRAN (R 4.4.0) #> P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.4.0) #> P withr 3.0.0 2024-01-16 [?] CRAN (R 4.4.0) #> P xfun 0.45 2024-06-16 [?] CRAN (R 4.4.0) #> P yaml 2.3.9 2024-07-05 [?] CRAN (R 4.4.0) #> #> [1] /Users/luke.zappia/Documents/Projects/20240710-NewScreenings-MyeloidLuminex/renv/library/macos/R-4.4/aarch64-apple-darwin20 #> [2] /Users/luke.zappia/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815 #> [3] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library #> #> P ── Loaded and on-disk path mismatch. #> #> ────────────────────────────────────────────────────────────────────────────── ```
davi-dmittelstadt commented 2 weeks ago

Hi

Just adding that the same error also occurs for Date variables when groups only have missing observations (NA).

library(dplyr)
library(skimr)

df <- data.frame(
  group = c("A", "B"), 
  b = c(Sys.Date(), NA))

skim(df)
Name df
Number of rows 2
Number of columns 2
_______________________
Column type frequency:
character 1
Date 1
________________________
Group variables None

Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
group 0 1 1 1 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
b 1 0.5 2024-11-07 2024-11-07 2024-11-07 1

df |> group_by(group) |> skim()
#> Error in vapply(.x, .f, .mold, ..., USE.NAMES = FALSE): values must be length 1,
#>  but FUN(X[[1]]) result is length 0

Created on 2024-11-07 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.2 (2022-10-31 ucrt) #> os Windows 10 x64 (build 22631) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United Kingdom.utf8 #> ctype English_United Kingdom.utf8 #> tz America/Chicago #> date 2024-11-07 #> pandoc 3.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.2.2) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.2) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.2) #> dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.2.2) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.2) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.2) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.2) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.2) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.2) #> highr 0.10 2022-12-22 [1] CRAN (R 4.2.2) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.2) #> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.2) #> knitr 1.41 2022-11-18 [1] CRAN (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.2) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.2) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.2) #> repr 1.1.7 2024-03-22 [1] CRAN (R 4.2.2) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.2) #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.2) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.2) #> rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.2.2) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.2) #> skimr * 2.1.5 2022-12-23 [1] CRAN (R 4.2.2) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.2) #> stringr 1.5.1 2023-11-14 [1] CRAN (R 4.2.2) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.2) #> tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.2.2) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.2) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.2) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.2.2) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.2) #> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.2) #> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.2) #> #> [1] C:/Users/david/AppData/Local/R/win-library/4.2 #> [2] C:/Program Files/R/R-4.2.2/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
elinw commented 39 minutes ago

Sorry for the long delay. I think we should definitely be handling this, similar to how we handle some other specific situations. THat is to say, we should fix the function where this is an issue. I want to look at other situations with is.NaN () == TRUE.