tidyverse / forcats

๐Ÿˆ๐Ÿˆ๐Ÿˆ๐Ÿˆ: tools for working with categorical variables (factors)
https://forcats.tidyverse.org/
Other
555 stars 127 forks source link

Feature proposal: enhanced base::factor() with coherance warnings #299

Closed DanChaltiel closed 2 years ago

DanChaltiel commented 2 years ago

The more I review my code and others', the more I realize that base::factor() causes a lot of problems by not throwing warnings when encountering an unknown level. Instead, it silently generates NA, which can cause heavy misunderstanding later.

Since it makes little sense to me to specify levels that do not exist in the actual input, I would expect some verbosity about it.

Here is an example:

library(glue)
library(rlang)
library(magrittr)

df_labels = read.table(header=TRUE, text="
    level label
    setsa SETO                #typo
    verssicolor VERSICO #typo
    virginica VIRGINI
")

x=as.character(iris$Species)

f1 = factor(x, levels=df_labels$level, labels=df_labels$label)
table(f1)
#> f1
#>    SETO VERSICO VIRGINI 
#>       0       0      50

Before calling table(), the user has no idea that the previous call had "failing" cases.

Here is the function I'm using instead:

fct = function(x=character(), levels, labels=levels, ...){
    miss_x = !x %in% levels
    if(any(miss_x)){
        miss_x_s = unique(x[miss_x]) %>% glue_collapse(", ")
        warn(c("Unknown factor level in `x`, NA generated.", 
               x=glue("Unknown levels: {miss_x_s}")))
    }
    factor(x, levels, labels, ...)
}
f2 = fct(x, levels=df_labels$level, labels=df_labels$label)
#> Warning: Unknown factor level in `x`, NA generated.
#> x Unknown levels: setosa, versicolor
table(f2)
#> f2
#>    SETO VERSICO VIRGINI 
#>       0       0      50

Created on 2022-02-13 by the reprex package (v2.0.1)

As more and more people rely on tidyverse to write cleaner code, I would guess this could belong here.

hadley commented 2 years ago

I think this seems like a good idea in principle, but there are some tricky details, like what should this function be called, and should it warn or error? (warnings are easily missed).

DanChaltiel commented 2 years ago

Indeed, I cannot see any case when this "replace unknown with NA" is a wanted behavior, so an error is definitely more appropriate. Maybe an argument could be added for more flexibility, something like fail_if_unknown=TRUE.

For the name, I liked fct(), as this function would be the very base of forcat API, but otherwise you might like fct_create(), fct_encode(), fct_coerce(), ... A bolder but nice move would be to even override base and bluntly call it factor(). Any code that breaks because of this was most likely rather bad code, don't you think?

huftis commented 2 years ago

I like the name fct(). Itโ€™s easy to remember that using fct() is safer than using factor(). And I donโ€™t thinking overriding the base function is a good idea. It will break tons of code (sometimes in a good way, but breaking code is a bad idea).

DanChaltiel commented 2 years ago

Here is another idea for fct() that I really miss from base::factor(): use the ellipsis to propose a level=label syntax:

fct = function(x=character(), ..., levels, labels=levels)
fct(iris$Species, setosa="S", versicolor="Ve", virginica="Vi")
fct(iris$Species, setosa="S", versicolor="Ve") #warning, level "virginica" not labelled

This also asks the more global question: should a missed level be set missing like base::factor() or should it be kept unchanged?

jennybc commented 2 years ago

By coincidence, we're looking at factor-related issues in vroom/readr right now.

FWIW, in terms of what's going on elsewhere in the tidyverse, parse_factor() and col_factor() throw a warning and report problems if levels are explicitly specified and some element of the input is not found in those levels (and also does not match an na string).

I'm linking to the dev documentation, because I happened to work on it today:

https://readr.tidyverse.org/dev/reference/parse_factor.html


I want to push back a bit on this particular framing:

Since it makes little sense to me to specify levels that do not exist in the actual input ...

Often there is external knowledge that informs explicit factor levels, but there's no reason to believe that every instance of such a factor must exhibit at least one specimen of each level. Examples: birth month, day of week, blood type.

I frame it more like so: if you care and know enough to set factor levels explicitly, you probably want to know if the input data exhibits values outside that set. I don't think we disagree about the goal; this is more a comment about wording.

Also relevant: most of the functions we're talking about here also have a way to explicitly declare values that should become NA: factor(exclude =), read.table(na.strings =), readr::read_*(na =), readr::parse_factor(na =).

hadley commented 2 years ago

Something like this maybe:

fct <- function(x = character(), levels = unique(x), na = character()) {
  x[x %in% na] <- NA

  invalid <- setdiff(x, c(levels, NA))

  if (length(invalid) > 0 ) {
    cli::cli_abort(c(
      "Values of {.arg x} must be members of {.arg levels}", 
      i = "Invalid value{?s}: {.str {invalid}}"
    ))
  }
  factor(x, levels = levels, exclude = NULL)
}

fct(c("x", "y", "z"))
#> [1] x y z
#> Levels: x y z
fct(c("x", "y", "z"), c("x", "y"))
#> Error in `fct()`:
#> ! Values of `x` must be members of `levels`
#> โ„น Invalid value: "z"
fct(c("x", "y", "z"), c("x", "y"), na = "z")
#> [1] x    y    <NA>
#> Levels: x y
fct(c("x", "y", "z"), c("x", "y", NA), na = "z")
#> [1] x    y    <NA>
#> Levels: x y <NA>

Created on 2022-05-04 by the reprex package (v2.0.1)

hadley commented 2 years ago

Note to self: if this implemented before R4DS 2e is published, need to replace readr::parse_factor() with fctr().

hadley commented 2 years ago

@jennybc do you think the idea of a set of valid levels is such a key part of factors that you should be forced to supply it on creation?

jennybc commented 2 years ago

do you think the idea of a set of valid levels is such a key part of factors that you should be forced to supply it on creation?

No, that feels too extreme to me. Level discovery seems super useful most of the time.

Looking at the proposal above I felt compelled to check this example:

fct(c("x", "y", "z"), na = "z")
#> [1] x    y    <NA>
#> Levels: x y <NA>

because I predicted that "z" would be among the levels and that the last element of the return value would be NA. I was wrong. Because the default of unique(x) doesn't get forced until after x[x %in% na] <- NA.

The outcome feels correct, but it does feel a bit magical / subtle. If the user specifies unique(x) for themselves, they get a different outcome:

x <- c("x", "y", "z")
fct(x, levels = unique(x), na = "z")
#> [1] x    y    <NA>
#> Levels: x y z

fct( c("x", "y", "z"), levels = unique(c("x", "y", "z")), na = "z")
#> [1] x    y    <NA>
#> Levels: x y z
hadley commented 2 years ago

So maybe make levels default to NULL and then compute explicitly?

Otherwise, I think if you're supplying both levels and na, it's up to you do so consistently.

jennybc commented 2 years ago

So maybe make levels default to NULL and then compute explicitly?

Yeah I think that does what people expect and has less potential for confusion.

Otherwise, I think if you're supplying both levels and na, it's up to you do so consistently.

Agreed.

hadley commented 2 years ago

Plus a bit more argument checking:

fct <- function(x = character(), levels = NULL, na = character()) {
  if (!is.character(x)) {
    cli::cli_abort("{.arg x} must be a character vector")
  }
  if (!is.character(na)) {
    cli::cli_abort("{.arg na} must be a character vector")
  }

  x[x %in% na] <- NA

  if (is.null(levels)) {
    levels <- unique(x)
  } else if (!is.character(levels)) {
    abort("`{.arg levels} must be a character vector")
  }

  invalid <- setdiff(x, c(levels, NA))

  if (length(invalid) > 0 ) {
    cli::cli_abort(c(
      "Values of {.arg x} must be members of {.arg levels}", 
      i = "Invalid value{?s}: {.str {invalid}}"
    ))
  }
  factor(x, levels = levels, exclude = NULL)
}

fct(c("x", "y", "z"))
fct(c("x", "y", "z"), c("x", "y"))
fct(c("x", "y", "z"), na = "z")
fct(c("x", "y", "z"), c("x", "y"), na = "z")
fct(c("x", "y", "z"), c("x", "y", NA), na = "z")