insightsengineering / cards

CDISC Analysis Results Data
https://insightsengineering.github.io/cards/
24 stars 0 forks source link

Factors with NA level #255

Open ddsjoberg opened 1 month ago

ddsjoberg commented 1 month ago

I was thinking about a special (and annoying) case where factors have explicit levels, but that level does not have name. I think the most common case is when users may use forcats::fct_na_value_to_level(). We don't have an error when running ard_continuous() BUT I think it'll throw a wrench into the shuffle functions (due to all the assumptions we make about NA values).

What do you think we should do? I am fine with detecting a level without a name and returning an error. What do you think? @bzkrouse

set.seed(123456)

# Create a version of iris$Species that has missing entries.
sampled_Species <- sample(c(NA, "setosa", "virginica", "versicolor"), size = 150, replace = TRUE)

# By default, forcats::fct_na_value_to_level() turns missings into a level called `NA` that is actually
# a missing level name.
na_Species <- forcats::fct_na_value_to_level(sampled_Species)

my_iris <- iris
my_iris$na_Species <- na_Species
levels(my_iris$na_Species)
#> [1] "setosa"     "versicolor" "virginica"  NA

cards::ard_continuous(
  my_iris,
  by = na_Species,
  variables = Sepal.Length
) |> 
  tail()
#> {cards} data frame: 6 x 10
#>       group1 group1_level     variable stat_name stat_label  stat
#> 1 na_Species           NA Sepal.Length        sd         SD 0.931
#> 2 na_Species           NA Sepal.Length    median     Median   6.4
#> 3 na_Species           NA Sepal.Length       p25  25th Per…   5.5
#> 4 na_Species           NA Sepal.Length       p75  75th Per…   6.7
#> 5 na_Species           NA Sepal.Length       min        Min   4.6
#> 6 na_Species           NA Sepal.Length       max        Max   7.9
#> ℹ 4 more variables: context, fmt_fn, warning, error

Created on 2024-06-02 with reprex v2.1.0

ddsjoberg commented 2 weeks ago

Just need to add this function

check_na_factor_levels <- function(data, variables) {
  walk(
    variables,
    \(variable) {
      if (is.factor(data[[variable]]) && any(is.na(levels(data[[variable]])))) {
        cli::cli_abort(
          "Factors with {.val {NA}} levels are not allowed, which are present in column {.val {variable}}.",
          call = get_cli_abort_call()
        )
      }
    }
  )
}