njtierney / naniar

Tidy data structures, summaries, and visualisations for missing data
http://naniar.njtierney.com/
Other
650 stars 54 forks source link

Is there a way to shade or recode_shadow on the whole df? #249

Open jzadra opened 4 years ago

jzadra commented 4 years ago

Is there any way to do a shade() or a recode_shadow() on the entire df to handle special missings like -99 for every column? Both seem to only operate on vectors currently.

njtierney commented 4 years ago

No currently but that is a great suggestion!

hhp94 commented 1 year ago

Thank you for {naniar}, please excuse me for bumping this feature request!

A common use case I can see for {naniar} is metabolism panel data where, in wide form, each column is a metabolite, metal, or chemical. These values have particular types of missings called "limit of detection (LOD)" or "limit of quantitation (LOQ)". It would be great if we can do recode_shadow() for all these columns using across() or the _if _at format. A typical panel data looks like the one below.

n_people <- 5
n_chemicals <- 5
prob_missing <- 0.5
chemical_names <- paste0("chemical_", seq_len(n_chemicals))

lod_fns <- function(n) {
  flip_coins <- runif(n)
  value <- round(rnorm(n), 3)
  lod_loq <- sample(c("NA_LOD", "NA_LOQ"), size = n, replace = TRUE)
  ifelse(flip_coins <= prob_missing, lod_loq, as.character(value))
}

panel_long <- data.frame(
  id = rep(seq_len(n_people), n_chemicals),
  chemicals = rep(chemical_names, each = n_people),
  value = lod_fns(n_people * n_chemicals)
  )

panel_wide <- panel_long |>
  tidyr::pivot_wider(id_cols = id,
                     names_from = "chemicals",
                     values_from = "value")

panel_wide

# id    chemical_1 chemical_2 chemical_3 chemical_4 chemical_5
# <int> <chr>      <chr>      <chr>      <chr>      <chr>     
# 1     1 NA_LOD     NA_LOQ     NA_LOQ     NA_LOQ     NA_LOQ    
# 2     2 NA_LOQ     NA_LOQ     NA_LOQ     NA_LOQ     NA_LOD    
# 3     3 -0.843     NA_LOQ     -0.275     NA_LOQ     -0.767    
# 4     4 1          NA_LOD     0.244      0.788      -0.532    
# 5     5 -1.823     NA_LOD     0.313      1.426      0.196 

For this special type of data, since the chemicals are usually the same type of data. I can see two solutions.

I am not too familiar with {naniar} source codes but I would love to take a crack at this. I would love some pointers to where I should start reading.

Cheers!

EDIT: typos