`datasummary_*` NA category display

vincentarelbundock commented 3 years ago

In PR https://github.com/vincentarelbundock/modelsummary/pull/284 @elbersb writes:

One thing that is still rather inconvenient is the handling of NA. Especially for more interactive work, having an argument missing or na.rm that adjusts whether NA are shown would be really helpful. Given your stance on new arguments, I wanted to check with you first. I could imagine having this functionality just in datasummary_crosstab, but maybe it's worth thinking about a mechanism that also works for datasummary and the other helper functions.

This issue also arises in the other datasummary_* family members. Currently, the recommended practice is to pre-process the data to make NA an explicit category. I include a screenshot of the vignette below.

My current view is that it's good to be explicit, and that a one-liner to clean up the factor beforehand is not much harder to use than an additional argument. However, that view is not too strongly held, so I'm happy to discuss.

elbersb commented 3 years ago

Especially for interactive work, I find this a bit inconvenient. It also requires changing the original data, which might have unintended side effects later on. One way to implement this would be to be to have something like FactorNA(), which takes care of the recode, or to have a global argument missing or so for datasummary, which runs a similar function as fct_explicit_na on all factor variables (on a copy of the data). From the perspective of datasummary_crosstab, the latter would be better, because otherwise one would need some additional mechanism for this function.

vincentarelbundock commented 3 years ago

I like the FactorNA idea a lot because it is consistent with the spirit of the datasummary and with the language of the tables package (which we use as a dependency).

I have a family event to attend so I can't look into this now, but FWIW, this is the code of the Factor pseudo-function from the tables package. Perhaps we can modify it to make things work, but I suspect it will be a bit tricky.

Factor <- function(x, name = deparse(expr), levelnames=levels(x),
                   texify = getOption("tables.texify", FALSE), expr = substitute(x), override = TRUE) {
    force(name)
    force(expr)
    x <- as.factor(x)
    if (texify) {
        levs <- sprintf("%s", levels(x)) # convert NA to "NA"
        levels(x) <- texify(levs)
    }
    force(levelnames)
    RowFactor(x, name = name, levelnames = levelnames, spacing=FALSE, 
              texify = texify, expr = expr, override = override)
}

elbersb commented 3 years ago

I briefly looked into this. The problem with FactorNA is that it won't work for maybe the two most common use cases, namely datasummary_balance and datasummary_crosstab. These functions (as the example you copied from the docs shows) don't involve Factor(), so there's no way for the user to switch to FactorNA. Maybe having missing as a new argument for datasummary is a better idea?

vincentarelbundock commented 3 years ago

What I had in mind was:

datasummary_crosstab(var1 * FactorNA(var2) ~ var3, data)

One additional problem with including an extra argument (beyond the issue already raised) is that datasummary has to deal with at least 3 distinct missigness-related problems:

na.rm=TRUE for statistics (e.g., datasummary(x ~ y * mean * Arguments(na.rm=TRUE)))
Cross-tab cells with no observations (see discussion here)
Should NA be treated as a factor category in datasummary_balance and datasummary_crosstab and datasummary_skim(type="categorical")?

This means we can't just have a single argument called "missing" for Problem 3, since that would be a confusing user-interface for people who are trying to solve the other two.

One "stricter" alternative would be to display a line for NAs by default, and to require explicit intervention by the user to remove that line. To do that, users could pre-process their data with na.rm. Alternatively, they could simply convert the variable as a factor with factor:

dat$x <- factor(dat$x)

Since exclude=NA by default in factor, the NAs won't be assigned to a distinct level and they will be omitted from the table automatically.

This would make including NA more convenient, and excluding NA less convenient. In interactive use, we probably want the safer or "more conservative" choice of keeping NAs, but when we prep tables for publication it's OK to invest one more line of code to clear out the NAs, so the latter should be the less convenient non-default option. , which does not assign NAs to a distinct factor level

vincentarelbundock commented 3 years ago

From https://github.com/vincentarelbundock/modelsummary/commit/3d6a2bcf570f6ee9aea5a1db688332805d62bcdb the NAs should appear by default in datasummary_balance and datasummary_crosstab output. Users who wish to make them disappear can do:

dat$x <- factor(dat$x)
datasummary_crosstab(x * y ~ z)

elbersb commented 3 years ago

Looks good!

vincentarelbundock / modelsummary

`datasummary_*` NA category display #286