Closed gsimchoni closed 3 years ago
I think that it is related to diet
being character instead of factor. I'll take a deeper look.
I am seeing something similar even when variables are factor and not character. Thanks for looking into this!
library(recipes)
#> Warning: package 'recipes' was built under R version 3.6.2
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(tidyverse)
a <- data.frame(y = 1:4, x = factor(c(letters[1:3], NA)))
b <- data.frame(y = 1:5, x = factor(c(letters[1:4], NA)))
recipe(y ~ ., data = a) %>%
step_unknown(all_nominal(), new_level = "missing") %>%
step_novel(all_nominal()) %>%
prep() %>%
bake(b)
#> # A tibble: 5 x 2
#> x y
#> <fct> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 <NA> 4
#> 5 missing 5
recipe(y ~ ., data = a) %>%
step_novel(all_nominal()) %>%
step_unknown(all_nominal(), new_level = "missing") %>%
prep() %>%
bake(b)
#> # A tibble: 5 x 2
#> x y
#> <fct> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 new 4
#> 5 missing 5
I looked into this a bit today, and this is happening because step_unknown()
sets the factor levels using the levels in the object
already, after replacing the NA
values:
In the example above, this means that when step_unknown()
comes first, the steps go:
NA
values with "new"
levelobject
, introducing NA
valuesstep_novel()
, but there are no novel levels anymore, just an NA
valueWhat do we think the best option is? 🤔
NA
values in step_unknown()
, which would result in two "missing"
entries and no "new"
entries when folks use step_unknown()
before step_novel()
. Is this better or less surprising at all? Or maybe worse, actually?step_unknown()
?This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org
Hi,
(Love this package and all the work you guys are doing)
Using the reproducible example from the docs, it works:
Doing
step_novel()
thenstep_unknown()
also works as expected:But
step_novel()
afterstep_unknown()
...If this is a bug - OK, if this is a feature and I'm missing something, could you please explain? Thanks.