tidyverse / forcats

🐈🐈🐈🐈: tools for working with categorical variables (factors)
https://forcats.tidyverse.org/
Other
554 stars 126 forks source link

Unclear warnings and errors generated when setting levels for a factor generated from a character vector #314

Closed wtimmerman-fitp closed 2 years ago

wtimmerman-fitp commented 2 years ago

When I use fct_relevel with the levels argument, I receive a warning that does not clearly indicate what is going wrong. Similarly, when I use the levels argument in forcats::as_factor()'s, (on the assumptions that arguments in .../ellipsis will be passed on to methods), I receive an error "Arguments in ... must be used". Both of these are unexpected results for me based on my understanding of the function help text and base::factor().

For background, my intention is to convert a character column into a factor column using a pre-specified list of levels (the pre-specified list is somewhat important as a check and consistency for reasons that I won't get into here). I have reviewed the forcats issues and don't see an exact match for this problem:

My questions are:

Reprex

library(tidyverse)

mtcars2 <-
  mtcars %>% 
  tibble::rownames_to_column(var = "make_model") %>% 
  dplyr::filter(
    dplyr::row_number() <= 5
  )

use_levels <-
  mtcars2 %>% 
  dplyr::pull(make_model) 

# this works as expected, since the provided levels will by definition match the values in the make_model column.
mtcars2_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = use_levels
    )
  )

# I don't understand why this is an error based on the as_factor() help.
mtcars2_as_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::as_factor(
      make_model,
      levels = use_levels
    )
  )
#> Error in `dplyr::mutate()`:
#> ! Problem while computing `make_model = forcats::as_factor(make_model,
#>   levels = use_levels)`.
#> Caused by error:
#> ! Arguments in `...` must be used.
#> x Problematic argument:
#> * levels = use_levels

# I don't understand why this generates this warning since use_levels does not have names
mtcars2_fct_relevel <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::fct_relevel(
      make_model,
      levels = use_levels
    )
  )
#> Warning: Outer names are only allowed for unnamed scalar atomic inputs

# when i modify use_levels to have a value not present in the column, more challenges arise.
use_levels_mod <-
  c(use_levels, "Other Car")

# base::factor is not noisy enough that there are factor levels not present in the data.
mtcars2_mod_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = use_levels_mod
    )
  )

# as_factor continus to error
mtcars2_mod_as_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::as_factor(
      make_model,
      levels = use_levels_mod
    )
  )
#> Error in `dplyr::mutate()`:
#> ! Problem while computing `make_model = forcats::as_factor(make_model,
#>   levels = use_levels_mod)`.
#> Caused by error:
#> ! Arguments in `...` must be used.
#> x Problematic argument:
#> * levels = use_levels_mod

# fct_relevel generates an expected warning, but still has the 
# original warning that makes little sense in this case.

mtcars2_mod_fct_relevel <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::fct_relevel(
      make_model,
      levels = use_levels_mod
    )
  )
#> Warning: Outer names are only allowed for unnamed scalar atomic inputs
#> Warning: Unknown levels in `f`: Other Car

Created on 2022-08-09 by the reprex package (v2.0.1)

Session info ``` r sessionInfo() #> R version 4.0.5 (2021-03-31) #> Platform: x86_64-w64-mingw32/x64 (64-bit) #> Running under: Windows 10 x64 (build 19043) #> #> Matrix products: default #> #> locale: #> [1] LC_COLLATE=English_United States.1252 #> [2] LC_CTYPE=English_United States.1252 #> [3] LC_MONETARY=English_United States.1252 #> [4] LC_NUMERIC=C #> [5] LC_TIME=English_United States.1252 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4 #> [5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.8 ggplot2_3.3.6 #> [9] tidyverse_1.3.2 #> #> loaded via a namespace (and not attached): #> [1] tidyselect_1.1.2 xfun_0.31 haven_2.5.0 #> [4] gargle_1.2.0 colorspace_2.0-3 vctrs_0.4.1 #> [7] generics_0.1.3 htmltools_0.5.3 yaml_2.3.5 #> [10] utf8_1.2.2 rlang_1.0.4 pillar_1.8.0 #> [13] glue_1.6.2 withr_2.5.0 DBI_1.1.3 #> [16] dbplyr_2.2.1 readxl_1.4.0 modelr_0.1.8 #> [19] lifecycle_1.0.1 munsell_0.5.0 gtable_0.3.0 #> [22] cellranger_1.1.0 rvest_1.0.2 evaluate_0.15 #> [25] knitr_1.39 tzdb_0.3.0 fastmap_1.1.0 #> [28] fansi_1.0.3 highr_0.9 broom_1.0.0 #> [31] backports_1.4.1 scales_1.2.0 googlesheets4_1.0.0 #> [34] jsonlite_1.8.0 fs_1.5.2 hms_1.1.1 #> [37] digest_0.6.29 stringi_1.7.8 grid_4.0.5 #> [40] cli_3.3.0 tools_4.0.5 magrittr_2.0.3 #> [43] crayon_1.5.1 pkgconfig_2.0.3 ellipsis_0.3.2 #> [46] xml2_1.3.3 reprex_2.0.1 googledrive_2.0.0 #> [49] lubridate_1.8.0 assertthat_0.2.1 rmarkdown_2.14 #> [52] httr_1.4.3 rstudioapi_0.13 R6_2.5.1 #> [55] compiler_4.0.5 ```
jennybc commented 2 years ago

Based on a quick read, I think you might be interested in fct()? More in #299.

wtimmerman-fitp commented 2 years ago

Oh, this is perfect! Thank you for the pointer! I think this will solve my issue. level named argument is there, no errors or warnings if an additional level is listed but not in data, errors (unlike base::factor) if one of the supplied levels is not in the data.

I'll close the issue and look forward to fct() getting into a future release.

(example below if anyone curious).

#setup ----
library(tidyverse)

fct <- function(x = character(), levels = NULL, na = character()) {
  if (!is.character(x)) {
    cli::cli_abort("{.arg x} must be a character vector")
  }
  if (!is.character(na)) {
    cli::cli_abort("{.arg na} must be a character vector")
  }

  x[x %in% na] <- NA

  if (is.null(levels)) {
    levels <- unique(x)
  } else if (!is.character(levels)) {
    abort("`{.arg levels} must be a character vector")
  }

  invalid <- setdiff(x, c(levels, NA))

  if (length(invalid) > 0 ) {
    cli::cli_abort(c(
      "Values of {.arg x} must be members of {.arg levels}", 
      i = "Invalid value{?s}: {.str {invalid}}"
    ))
  }
  factor(x, levels = levels, exclude = NULL)
}

mtcars2 <-
  mtcars %>% 
  tibble::rownames_to_column(var = "make_model") %>% 
  dplyr::filter(
    dplyr::row_number() <= 5
  )

# Match levels----
match_levels <-
  mtcars2 %>% 
  dplyr::pull(make_model) 

mtcars2_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = match_levels
    )
  )

mtcars2_fct <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = fct(
      make_model,
      levels = match_levels
    )
  )

# Add Levels ----
add_levels <-
  c(match_levels, "Other Car")

mtcars2_add_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = add_levels
    )
  )

mtcars2_add_fct <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = fct(
      make_model,
      levels = add_levels
    )
  )

levels(mtcars2_add_fct$make_model)
#> [1] "Mazda RX4"         "Mazda RX4 Wag"     "Datsun 710"       
#> [4] "Hornet 4 Drive"    "Hornet Sportabout" "Other Car"

# Miss Levels ----
miss_levels <-
  match_levels[-1]

mtcars2_miss_factor <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = base::factor(
      make_model,
      levels = miss_levels
    )
  )

mtcars2_miss_fct <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = fct(
      make_model,
      levels = miss_levels
    )
  )
#> Error in `dplyr::mutate()`:
#> ! Problem while computing `make_model = fct(make_model, levels =
#>   miss_levels)`.
#> Caused by error in `fct()`:
#> ! Values of `x` must be members of `levels`
#> i Invalid value: "Mazda RX4"

Created on 2022-08-09 by the reprex package (v2.0.1)

wtimmerman-fitp commented 2 years ago

Also, if anyone runs into the same warning I got with fct_relevel (Warning: Outer names are only allowed for unnamed scalar atomic inputs), it's because you can't use the levels argument for that function; just pass the vector object of level names (in this case, use_levels) into the ellipsis on its own like:

mtcars2_fct_relevel <-
  mtcars2 %>% 
  dplyr::mutate(
    make_model = forcats::fct_relevel(
      make_model,
      use_levels
    )
  )