insightsengineering / tern

Table, Listings, and Graphs (TLG) library for common outputs used in clinical trials
https://insightsengineering.github.io/tern/
Other
77 stars 22 forks source link

[Feature Request]: <add factor_level_method argument for df_explicit_na function> #1322

Open kaipingyang opened 1 month ago

kaipingyang commented 1 month ago

Feature description

Hi @shajoezhu, in data preprocessing, we often use df_explicit_na function to convert character variables into factor variables. df_explicit_na source code uses factor function, factor defaults to sort(unique(x)) to assign the sorting result of data to level. We found the following issues:

We note that the results of R's sort function for default sorting are not consistent with the results of SAS's proc sort. But the sort function method with "radix" and the tidyverse arrange function give the same results as SAS.

> library(tidyverse)
> data <- data.frame(
+   var = c("Cellulitis","COVID-19","Conjunctivitis","_","-","%")
+ )
> 
> sort(data$var)
[1] "-"              "%"              "_"             
[4] "Cellulitis"     "Conjunctivitis" "COVID-19"      
> sort(unique(data$var), method = "radix")
[1] "%"              "-"              "COVID-19"      
[4] "Cellulitis"     "Conjunctivitis" "_"             
> data %>% arrange(var)
             var
1              %
2              -
3       COVID-19
4     Cellulitis
5 Conjunctivitis
6              _

We need to specify the factor level as sort(unique(x), method = "radix") to get the same factor level order as the SAS proc sort.

> factor(data$var)
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: - % _ Cellulitis Conjunctivitis COVID-19
> factor(data$var, levels = sort(unique(data$var), method = "radix"))
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: % - COVID-19 Cellulitis Conjunctivitis _

Therefore, the result of df_explicit_na function transformation is also inconsistent with SAS.

> data1 <- df_explicit_na(data)
> data1$var
[1] Cellulitis     COVID-19       Conjunctivitis _             
[5] -              %             
Levels: - % _ Cellulitis Conjunctivitis COVID-19

So we want to add factor_level_method argument for df_explicit_na function:

Furthermore, can we specify how the level of a specific variable should be set by passing a vector with a name? such as: factor_level_method = c("a" = "data", "b" = "sort_radix").

Code of Conduct

Contribution Guidelines

Security Policy

Melkiades commented 1 month ago

@kaipingyang I think this makes perfect sense! Feel free to open a PR with the above new parameter. I can personally review it so we get this in asap!