Hi @shajoezhu, in data preprocessing, we often use df_explicit_na function to convert character variables into factor variables.
df_explicit_na source code uses factor function, factor defaults to sort(unique(x)) to assign the sorting result of data to level.
We found the following issues:
We note that the results of R's sort function for default sorting are not consistent with the results of SAS's proc sort.
But the sort function method with "radix" and the tidyverse arrange function give the same results as SAS.
So we want to add factor_level_method argument for df_explicit_na function:
When factor_level_method = "data", the factor levels are sorted according to the order in which each value first appears in the data, that is, unique(x).
When factor_level_method = "sort_auto" or "default", factor's level is sort(unique(x)).
When factor_level_method = "sort_radix", the factor level is sort(unique(x), method = "radix").
Furthermore, can we specify how the level of a specific variable should be set by passing a vector with a name?
such as: factor_level_method = c("a" = "data", "b" = "sort_radix").
Code of Conduct
[X] I agree to follow this project's Code of Conduct.
Contribution Guidelines
[X] I agree to follow this project's Contribution Guidelines.
Security Policy
[X] I agree to follow this project's Security Policy.
Feature description
Hi @shajoezhu, in data preprocessing, we often use df_explicit_na function to convert character variables into factor variables. df_explicit_na source code uses factor function, factor defaults to sort(unique(x)) to assign the sorting result of data to level. We found the following issues:
We note that the results of R's sort function for default sorting are not consistent with the results of SAS's proc sort. But the sort function method with "radix" and the tidyverse arrange function give the same results as SAS.
We need to specify the factor level as sort(unique(x), method = "radix") to get the same factor level order as the SAS proc sort.
Therefore, the result of df_explicit_na function transformation is also inconsistent with SAS.
So we want to add factor_level_method argument for df_explicit_na function:
Furthermore, can we specify how the level of a specific variable should be set by passing a vector with a name? such as: factor_level_method = c("a" = "data", "b" = "sort_radix").
Code of Conduct
Contribution Guidelines
Security Policy