Closed DanChaltiel closed 2 years ago
I think this seems like a good idea in principle, but there are some tricky details, like what should this function be called, and should it warn or error? (warnings are easily missed).
Indeed, I cannot see any case when this "replace unknown with NA" is a wanted behavior, so an error is definitely more appropriate. Maybe an argument could be added for more flexibility, something like fail_if_unknown=TRUE
.
For the name, I liked fct()
, as this function would be the very base of forcat
API, but otherwise you might like fct_create()
, fct_encode()
, fct_coerce()
, ... A bolder but nice move would be to even override base
and bluntly call it factor()
. Any code that breaks because of this was most likely rather bad code, don't you think?
I like the name fct()
. Itโs easy to remember that using fct()
is safer than using factor()
. And I donโt thinking overriding the base function is a good idea. It will break tons of code (sometimes in a good way, but breaking code is a bad idea).
Here is another idea for fct()
that I really miss from base::factor()
: use the ellipsis to propose a level=label
syntax:
fct = function(x=character(), ..., levels, labels=levels)
fct(iris$Species, setosa="S", versicolor="Ve", virginica="Vi")
fct(iris$Species, setosa="S", versicolor="Ve") #warning, level "virginica" not labelled
This also asks the more global question: should a missed level be set missing like base::factor()
or should it be kept unchanged?
By coincidence, we're looking at factor-related issues in vroom/readr right now.
FWIW, in terms of what's going on elsewhere in the tidyverse, parse_factor()
and col_factor()
throw a warning and report problems if levels
are explicitly specified and some element of the input is not found in those levels (and also does not match an na
string).
I'm linking to the dev documentation, because I happened to work on it today:
https://readr.tidyverse.org/dev/reference/parse_factor.html
I want to push back a bit on this particular framing:
Since it makes little sense to me to specify levels that do not exist in the actual input ...
Often there is external knowledge that informs explicit factor levels, but there's no reason to believe that every instance of such a factor must exhibit at least one specimen of each level. Examples: birth month, day of week, blood type.
I frame it more like so: if you care and know enough to set factor levels explicitly, you probably want to know if the input data exhibits values outside that set. I don't think we disagree about the goal; this is more a comment about wording.
Also relevant: most of the functions we're talking about here also have a way to explicitly declare values that should become NA
: factor(exclude =)
, read.table(na.strings =)
, readr::read_*(na =)
, readr::parse_factor(na =)
.
Something like this maybe:
fct <- function(x = character(), levels = unique(x), na = character()) {
x[x %in% na] <- NA
invalid <- setdiff(x, c(levels, NA))
if (length(invalid) > 0 ) {
cli::cli_abort(c(
"Values of {.arg x} must be members of {.arg levels}",
i = "Invalid value{?s}: {.str {invalid}}"
))
}
factor(x, levels = levels, exclude = NULL)
}
fct(c("x", "y", "z"))
#> [1] x y z
#> Levels: x y z
fct(c("x", "y", "z"), c("x", "y"))
#> Error in `fct()`:
#> ! Values of `x` must be members of `levels`
#> โน Invalid value: "z"
fct(c("x", "y", "z"), c("x", "y"), na = "z")
#> [1] x y <NA>
#> Levels: x y
fct(c("x", "y", "z"), c("x", "y", NA), na = "z")
#> [1] x y <NA>
#> Levels: x y <NA>
Created on 2022-05-04 by the reprex package (v2.0.1)
Note to self: if this implemented before R4DS 2e is published, need to replace readr::parse_factor()
with fctr()
.
@jennybc do you think the idea of a set of valid levels is such a key part of factors that you should be forced to supply it on creation?
do you think the idea of a set of valid levels is such a key part of factors that you should be forced to supply it on creation?
No, that feels too extreme to me. Level discovery seems super useful most of the time.
Looking at the proposal above I felt compelled to check this example:
fct(c("x", "y", "z"), na = "z")
#> [1] x y <NA>
#> Levels: x y <NA>
because I predicted that "z"
would be among the levels and that the last element of the return value would be NA
. I was wrong. Because the default of unique(x)
doesn't get forced until after x[x %in% na] <- NA
.
The outcome feels correct, but it does feel a bit magical / subtle. If the user specifies unique(x)
for themselves, they get a different outcome:
x <- c("x", "y", "z")
fct(x, levels = unique(x), na = "z")
#> [1] x y <NA>
#> Levels: x y z
fct( c("x", "y", "z"), levels = unique(c("x", "y", "z")), na = "z")
#> [1] x y <NA>
#> Levels: x y z
So maybe make levels
default to NULL
and then compute explicitly?
Otherwise, I think if you're supplying both levels
and na
, it's up to you do so consistently.
So maybe make levels default to NULL and then compute explicitly?
Yeah I think that does what people expect and has less potential for confusion.
Otherwise, I think if you're supplying both
levels
andna
, it's up to you do so consistently.
Agreed.
Plus a bit more argument checking:
fct <- function(x = character(), levels = NULL, na = character()) {
if (!is.character(x)) {
cli::cli_abort("{.arg x} must be a character vector")
}
if (!is.character(na)) {
cli::cli_abort("{.arg na} must be a character vector")
}
x[x %in% na] <- NA
if (is.null(levels)) {
levels <- unique(x)
} else if (!is.character(levels)) {
abort("`{.arg levels} must be a character vector")
}
invalid <- setdiff(x, c(levels, NA))
if (length(invalid) > 0 ) {
cli::cli_abort(c(
"Values of {.arg x} must be members of {.arg levels}",
i = "Invalid value{?s}: {.str {invalid}}"
))
}
factor(x, levels = levels, exclude = NULL)
}
fct(c("x", "y", "z"))
fct(c("x", "y", "z"), c("x", "y"))
fct(c("x", "y", "z"), na = "z")
fct(c("x", "y", "z"), c("x", "y"), na = "z")
fct(c("x", "y", "z"), c("x", "y", NA), na = "z")
The more I review my code and others', the more I realize that
base::factor()
causes a lot of problems by not throwing warnings when encountering an unknown level. Instead, it silently generatesNA
, which can cause heavy misunderstanding later.Since it makes little sense to me to specify levels that do not exist in the actual input, I would expect some verbosity about it.
Here is an example:
Before calling
table()
, the user has no idea that the previous call had "failing" cases.Here is the function I'm using instead:
Created on 2022-02-13 by the reprex package (v2.0.1)
As more and more people rely on
tidyverse
to write cleaner code, I would guess this could belong here.