I have found that basic retrievals with curatedTCGAData lead to colData with thousands of fields. The following function will remove fields for which frequency of NA exceeds some given fraction (defaulting to 20%) and optionally drops fields whose names contain given strings. Pitfalls, other approaches?
> cdTrimmer
function(dfr, maxnaFrac=.2, killstrings=c("portion", "analyte")) {
stopifnot(inherits(dfr, "DataFrame"))
meanNA = apply(dfr,2,function(x)mean(is.na(x)))
todrop = which(meanNA >= maxnaFrac) # initial indices
nn = names(dfr)
for (s in killstrings)
todrop = unique(c(todrop, grep(s, nn)))
dfr[,-todrop]
}
Copied from original issue: waldronlab/curatedMetagenomicData#152
From @vjcitn on January 14, 2019 10:26
I have found that basic retrievals with curatedTCGAData lead to colData with thousands of fields. The following function will remove fields for which frequency of NA exceeds some given fraction (defaulting to 20%) and optionally drops fields whose names contain given strings. Pitfalls, other approaches?
Copied from original issue: waldronlab/curatedMetagenomicData#152