`ard_stack()` doesn't work without a `by` argument specified

ddsjoberg commented 1 month ago

It's useful to use ard_stack() without a by argument for the .attributes and .missing arguments.

library(cards)
packageVersion("cards")
#> [1] '0.1.0.9023'

ard_stack(
  data = ADSL, 
  ard_categorical(variables = "AGEGR1"),
  ard_continuous(variables = "AGE")
)
#> Error in `ard_stack()`:
#> ! Error processing `by` argument.
#> ! ℹ In argument: `ard_categorical(variables = "AGEGR1")`. Caused by error in
#>   `ard_stack()`: ! The `data` argument cannot be missing.
#> ℹ Select among columns "STUDYID", "USUBJID", "SUBJID", "SITEID", "SITEGR1",
#>   "ARM", "TRT01P", "TRT01PN", "TRT01A", "TRT01AN", "TRTSDT", "TRTEDT",
#>   "TRTDUR", "AVGDD", "CUMDOSE", "AGE", "AGEGR1", "AGEGR1N", …, "DCREASCD", and
#>   "MMSETOT"

^{Created on 2024-05-08 with reprex v2.1.0}

ddsjoberg commented 1 month ago

I think the issue here is that when by is not specified, the function performs location matching of the argument, and therefore, thinks by = ard_categorical(variables = "AGEGR1").

If we want the by argument to be optional, we should probably move it to after the ... ? The other alternative would be to keep it where it is, but we can't have a default value, e.g. ard_stack(data, by, ..., .attributes = FALSE, .missing = FALSE, .shuffle = FALSE), which would require the user to input something.

@bzkrouse what do you think? Do you have a preference for one of these options? I assume tfrmt users could also want this functionality to shuffle easily?

library(cards)
packageVersion("cards")
#> [1] '0.1.0.9023'

# everything does work if we specify `by=NULL`
ard_stack(
  data = ADSL, 
  by = NULL,
  ard_categorical(variables = "AGEGR1"),
  ard_continuous(variables = "AGE")
)
#> {cards} data frame: 17 x 9
#>    variable variable_level   context stat_name stat_label   stat
#> 1    AGEGR1            <65 categori…         n          n     33
#> 2    AGEGR1            <65 categori…         N          N    254
#> 3    AGEGR1            <65 categori…         p          %   0.13
#> 4    AGEGR1            >80 categori…         n          n     77
#> 5    AGEGR1            >80 categori…         N          N    254
#> 6    AGEGR1            >80 categori…         p          %  0.303
#> 7    AGEGR1          65-80 categori…         n          n    144
#> 8    AGEGR1          65-80 categori…         N          N    254
#> 9    AGEGR1          65-80 categori…         p          %  0.567
#> 10      AGE                continuo…         N          N    254
#> 11      AGE                continuo…      mean       Mean 75.087
#> 12      AGE                continuo…        sd         SD  8.246
#> 13      AGE                continuo…    median     Median     77
#> 14      AGE                continuo…       p25  25th Per…     70
#> 15      AGE                continuo…       p75  75th Per…     81
#> 16      AGE                continuo…       min        Min     51
#> 17      AGE                continuo…       max        Max     89
#> ℹ 3 more variables: fmt_fn, warning, error

^{Created on 2024-05-08 with reprex v2.1.0}

ddsjoberg commented 1 month ago

OPTION 1: Move the argument

ard_stack(data, ..., .attributes  = FALSE, .by = NULL, .missing  = FALSE, .shuffle  = FALSE)

the reason I like this option is because we could easily extend ard_stack() with additional arguments, and it wouuld be clear that they all appear at the end. Otherwise, if we kept the position where it is now, and wanted to add new arguments it could be very disruptive.

We could also make the default .by = dplyr::group_vars(data) like the ard summary functions, so if you do use grouped data frames, you wouldn't need to supply a value.

OPTION 2: Remove the argument default

ard_stack(data, by, ..., .attributes  = FALSE, .missing  = FALSE, .shuffle  = FALSE)

I like that it feels like the natural place for the by argument. But with that said, users can still place the argument there with ard_stack(ADSL, .by = ARM, ...)

I guess I am leaning toward option 1

bzkrouse commented 1 month ago

In support of option #1 and the proposed default of group_vars(data) would be convenient!!

insightsengineering / cards

`ard_stack()` doesn't work without a `by` argument specified #243