Closed vincentarelbundock closed 3 years ago
Especially for interactive work, I find this a bit inconvenient. It also requires changing the original data, which might have unintended side effects later on. One way to implement this would be to be to have something like FactorNA()
, which takes care of the recode, or to have a global argument missing
or so for datasummary
, which runs a similar function as fct_explicit_na
on all factor variables (on a copy of the data). From the perspective of datasummary_crosstab
, the latter would be better, because otherwise one would need some additional mechanism for this function.
I like the FactorNA
idea a lot because it is consistent with the spirit of the datasummary
and with the language of the tables
package (which we use as a dependency).
I have a family event to attend so I can't look into this now, but FWIW, this is the code of the Factor
pseudo-function from the tables
package. Perhaps we can modify it to make things work, but I suspect it will be a bit tricky.
Factor <- function(x, name = deparse(expr), levelnames=levels(x),
texify = getOption("tables.texify", FALSE), expr = substitute(x), override = TRUE) {
force(name)
force(expr)
x <- as.factor(x)
if (texify) {
levs <- sprintf("%s", levels(x)) # convert NA to "NA"
levels(x) <- texify(levs)
}
force(levelnames)
RowFactor(x, name = name, levelnames = levelnames, spacing=FALSE,
texify = texify, expr = expr, override = override)
}
I briefly looked into this. The problem with FactorNA
is that it won't work for maybe the two most common use cases, namely datasummary_balance
and datasummary_crosstab
. These functions (as the example you copied from the docs shows) don't involve Factor()
, so there's no way for the user to switch to FactorNA
. Maybe having missing
as a new argument for datasummary
is a better idea?
What I had in mind was:
datasummary_crosstab(var1 * FactorNA(var2) ~ var3, data)
One additional problem with including an extra argument (beyond the issue already raised) is that datasummary
has to deal with at least 3 distinct missigness-related problems:
na.rm=TRUE
for statistics (e.g., datasummary(x ~ y * mean * Arguments(na.rm=TRUE))
)NA
be treated as a factor category in datasummary_balance
and datasummary_crosstab
and datasummary_skim(type="categorical")
?This means we can't just have a single argument called "missing
" for Problem 3, since that would be a confusing user-interface for people who are trying to solve the other two.
One "stricter" alternative would be to display a line for NA
s by default, and to require explicit intervention by the user to remove that line. To do that, users could pre-process their data with na.rm
. Alternatively, they could simply convert the variable as a factor with factor
:
dat$x <- factor(dat$x)
Since exclude=NA
by default in factor
, the NA
s won't be assigned to a distinct level and they will be omitted from the table automatically.
This would make including NA
more convenient, and excluding NA
less convenient. In interactive use, we probably want the safer or "more conservative" choice of keeping NA
s, but when we prep tables for publication it's OK to invest one more line of code to clear out the NA
s, so the latter should be the less convenient non-default option.
, which does not assign NA
s to a distinct factor level
From https://github.com/vincentarelbundock/modelsummary/commit/3d6a2bcf570f6ee9aea5a1db688332805d62bcdb the NAs should appear by default in datasummary_balance
and datasummary_crosstab
output. Users who wish to make them disappear can do:
dat$x <- factor(dat$x)
datasummary_crosstab(x * y ~ z)
Looks good!
In PR https://github.com/vincentarelbundock/modelsummary/pull/284 @elbersb writes:
This issue also arises in the other
datasummary_*
family members. Currently, the recommended practice is to pre-process the data to makeNA
an explicit category. I include a screenshot of the vignette below.My current view is that it's good to be explicit, and that a one-liner to clean up the factor beforehand is not much harder to use than an additional argument. However, that view is not too strongly held, so I'm happy to discuss.