waldronlab / curatedTCGAData

Curated Data From The Cancer Genome Atlas (TCGA) as MultiAssayExperiment Objects
https://bioconductor.org/packages/curatedTCGAData
41 stars 7 forks source link

systematic colData trimming #23

Closed lwaldron closed 5 years ago

lwaldron commented 5 years ago

From @vjcitn on January 14, 2019 10:26

I have found that basic retrievals with curatedTCGAData lead to colData with thousands of fields. The following function will remove fields for which frequency of NA exceeds some given fraction (defaulting to 20%) and optionally drops fields whose names contain given strings. Pitfalls, other approaches?

> cdTrimmer
function(dfr, maxnaFrac=.2, killstrings=c("portion", "analyte")) {
 stopifnot(inherits(dfr, "DataFrame"))
 meanNA = apply(dfr,2,function(x)mean(is.na(x)))
 todrop = which(meanNA >= maxnaFrac) # initial indices
 nn = names(dfr)
 for (s in killstrings)
    todrop = unique(c(todrop, grep(s, nn)))
 dfr[,-todrop]
}

Copied from original issue: waldronlab/curatedMetagenomicData#152

lwaldron commented 5 years ago

This issue was moved to waldronlab/TCGAutils#19