systematic colData trimming

From @vjcitn on January 14, 2019 10:26

I have found that basic retrievals with curatedTCGAData lead to colData with thousands of fields. The following function will remove fields for which frequency of NA exceeds some given fraction (defaulting to 20%) and optionally drops fields whose names contain given strings. Pitfalls, other approaches?

> cdTrimmer
function(dfr, maxnaFrac=.2, killstrings=c("portion", "analyte")) {
 stopifnot(inherits(dfr, "DataFrame"))
 meanNA = apply(dfr,2,function(x)mean(is.na(x)))
 todrop = which(meanNA >= maxnaFrac) # initial indices
 nn = names(dfr)
 for (s in killstrings)
    todrop = unique(c(todrop, grep(s, nn)))
 dfr[,-todrop]
}

Copied from original issue: waldronlab/curatedMetagenomicData#152

waldronlab / curatedTCGAData

systematic colData trimming #23