waldronlab / TCGAutils

Toolbox package for organizing and working with TCGA data
https://bioconductor.org/packages/TCGAutils
22 stars 6 forks source link

systematic colData trimming #19

Closed lwaldron closed 5 years ago

lwaldron commented 5 years ago

From @lwaldron on January 14, 2019 14:30

From @vjcitn on January 14, 2019 10:26

I have found that basic retrievals with curatedTCGAData lead to colData with thousands of fields. The following function will remove fields for which frequency of NA exceeds some given fraction (defaulting to 20%) and optionally drops fields whose names contain given strings. Pitfalls, other approaches?

> cdTrimmer
function(dfr, maxnaFrac=.2, killstrings=c("portion", "analyte")) {
 stopifnot(inherits(dfr, "DataFrame"))
 meanNA = apply(dfr,2,function(x)mean(is.na(x)))
 todrop = which(meanNA >= maxnaFrac) # initial indices
 nn = names(dfr)
 for (s in killstrings)
    todrop = unique(c(todrop, grep(s, nn)))
 dfr[,-todrop]
}

Copied from original issue: waldronlab/curatedMetagenomicData#152

Copied from original issue: waldronlab/curatedTCGAData#23

lwaldron commented 5 years ago

Sorry to move this a second time, but it probably does belong in TCGAUtils. How about the following function?

trimColData <- 
  function(object, maxnaFrac=.2, killstrings=c("portion", "analyte")) {
    dfr <- colData(object)
    killstrings <- na.omit(killstrings)
    stopifnot(inherits(dfr, "DataFrame"))
    meanNA = apply(dfr,2,function(x)mean(is.na(x)))
    manyNA <- meanNA >= maxnaFrac
    allkill <- rep(FALSE, length(dfr))
    for (i in seq_along(killstrings)){
      allkill <- allkill | grepl(killstrings[i], names(dfr))
    }
    todrop <- manyNA | allkill
    colData(object) <- dfr[, !todrop]
    return(object)
  }
vjcitn commented 5 years ago

Looks fine. Sorry for posting to the wrong repo, two repos ago!

LiNk-NY commented 5 years ago

Thanks for the contribution see commit https://github.com/waldronlab/TCGAutils/commit/45c8cacccc14200721695fe4ab25cc4e398d29b2