[Feature Request]: <add factor_level_method argument for df_explicit_na function>

Feature description

Hi @shajoezhu, in data preprocessing, we often use df_explicit_na function to convert character variables into factor variables. df_explicit_na source code uses factor function, factor defaults to sort(unique(x)) to assign the sorting result of data to level. We found the following issues:

We note that the results of R's sort function for default sorting are not consistent with the results of SAS's proc sort. But the sort function method with "radix" and the tidyverse arrange function give the same results as SAS.

> library(tidyverse)
> data <- data.frame(
+   var = c("Cellulitis","COVID-19","Conjunctivitis","_","-","%")
+ )
> 
> sort(data$var)
[1] "-"              "%"              "_"             
[4] "Cellulitis"     "Conjunctivitis" "COVID-19"      
> sort(unique(data$var), method = "radix")
[1] "%"              "-"              "COVID-19"      
[4] "Cellulitis"     "Conjunctivitis" "_"             
> data %>% arrange(var)
             var
1              %
2              -
3       COVID-19
4     Cellulitis
5 Conjunctivitis
6              _

We need to specify the factor level as sort(unique(x), method = "radix") to get the same factor level order as the SAS proc sort.

> factor(data$var)
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: - % _ Cellulitis Conjunctivitis COVID-19
> factor(data$var, levels = sort(unique(data$var), method = "radix"))
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: % - COVID-19 Cellulitis Conjunctivitis _

Therefore, the result of df_explicit_na function transformation is also inconsistent with SAS.

> data1 <- df_explicit_na(data)
> data1$var
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: - % _ Cellulitis Conjunctivitis COVID-19

So we want to add factor_level_method argument for df_explicit_na function:

When factor_level_method = "data", the factor levels are sorted according to the order in which each value first appears in the data, that is, unique(x).
When factor_level_method = "sort_auto" or "default", factor's level is sort(unique(x)).
When factor_level_method = "sort_radix", the factor level is sort(unique(x), method = "radix").

Furthermore, can we specify how the level of a specific variable should be set by passing a vector with a name? such as: factor_level_method = c("a" = "data", "b" = "sort_radix").

insightsengineering / tern

[Feature Request]: <add factor_level_method argument for df_explicit_na function> #1322

Feature description

Code of Conduct

Contribution Guidelines

Security Policy