Closed lwaldron closed 5 years ago
Sorry to move this a second time, but it probably does belong in TCGAUtils. How about the following function?
trimColData <-
function(object, maxnaFrac=.2, killstrings=c("portion", "analyte")) {
dfr <- colData(object)
killstrings <- na.omit(killstrings)
stopifnot(inherits(dfr, "DataFrame"))
meanNA = apply(dfr,2,function(x)mean(is.na(x)))
manyNA <- meanNA >= maxnaFrac
allkill <- rep(FALSE, length(dfr))
for (i in seq_along(killstrings)){
allkill <- allkill | grepl(killstrings[i], names(dfr))
}
todrop <- manyNA | allkill
colData(object) <- dfr[, !todrop]
return(object)
}
Looks fine. Sorry for posting to the wrong repo, two repos ago!
Thanks for the contribution see commit https://github.com/waldronlab/TCGAutils/commit/45c8cacccc14200721695fe4ab25cc4e398d29b2
From @lwaldron on January 14, 2019 14:30
From @vjcitn on January 14, 2019 10:26
I have found that basic retrievals with curatedTCGAData lead to colData with thousands of fields. The following function will remove fields for which frequency of NA exceeds some given fraction (defaulting to 20%) and optionally drops fields whose names contain given strings. Pitfalls, other approaches?
Copied from original issue: waldronlab/curatedMetagenomicData#152
Copied from original issue: waldronlab/curatedTCGAData#23