rcastelo / GSVA

Gene set variation analysis
200 stars 40 forks source link

fastmatch::fmatch modifies names in place #39

Closed LTLA closed 3 years ago

LTLA commented 3 years ago

Taken from ?gsva:

     p <- 10 ## number of genes
     n <- 30 ## number of samples
     nGrp1 <- 15 ## number of samples in group 1
     nGrp2 <- n - nGrp1 ## number of samples in group 2

     geneSets <- list(set1=paste("g", 1:3, sep=""),
                      set2=paste("g", 4:6, sep=""),
                      set3=paste("g", 7:10, sep=""))
     y <- matrix(rnorm(n*p), nrow=p, ncol=n,
                 dimnames=list(paste("g", 1:p, sep="") , paste("s", 1:n, sep="")))
     y[geneSets$set1, (nGrp1+1):n] <- y[geneSets$set1, (nGrp1+1):n] + 2

     library(limma)
     design <- cbind(sampleGroup1=1, sampleGroup2vs1=c(rep(0, nGrp1), rep(1, nGrp2)))
     fit <- lmFit(y, design)
     fit <- eBayes(fit)
     topTable(fit, coef="sampleGroup2vs1")

rownames(y) # okay, so far so good:
##  [1] "g1"  "g2"  "g3"  "g4"  "g5"  "g6"  "g7"  "g8"  "g9"  "g10"

     ## estimate GSVA enrichment scores for the three sets
     library(GSVA)
     gsva_es <- gsva(y, geneSets, mx.diff=1)

rownames(y) # What the hey!?
##  [1] "g1"  "g2"  "g3"  "g4"  "g5"  "g6"  "g7"  "g8"  "g9"  "g10"
## attr(,".match.hash")
## <hash table>

You can see how calling gsva modifies rownames(y) in place. This is very unexpected; R functions should not exhibit side effects like this, given that y is not an environment or reference class.

The addition of attributes breaks downstream functions like the SummarizedExperiment constructor:

SummarizedExperiment::SummarizedExperiment(y)
## Error in validObject(.Object) :
##   invalid class “SummarizedExperiment” object:
##     'names(x)' must be NULL or a character vector with no attributes

Why not just use base::match()? Surely the speed improvement, if any, cannot be so fantastic as to tolerate this behavior.

Session information ``` R Under development (unstable) (2020-11-09 r79409) Platform: x86_64-apple-darwin19.5.0 (64-bit) Running under: macOS Catalina 10.15.5 Matrix products: default BLAS: /Users/luna/Software/R/trunk/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GSVA_1.39.11 limma_3.47.3 loaded via a namespace (and not attached): [1] Rcpp_1.0.6 compiler_4.1.0 [3] GenomeInfoDb_1.27.3 XVector_0.31.1 [5] DelayedMatrixStats_1.13.4 rhdf5filters_1.3.3 [7] MatrixGenerics_1.3.0 bitops_1.0-6 [9] tools_4.1.0 zlibbioc_1.37.0 [11] SingleCellExperiment_1.13.4 digest_0.6.27 [13] bit_4.0.4 rhdf5_2.35.0 [15] lattice_0.20-41 annotate_1.69.0 [17] RSQLite_2.2.2 memoise_1.1.0 [19] rlang_0.4.10 fastmatch_1.1-0 [21] Matrix_1.3-2 graph_1.69.0 [23] DelayedArray_0.17.7 DBI_1.1.1 [25] parallel_4.1.0 GenomeInfoDbData_1.2.4 [27] httr_1.4.2 S4Vectors_0.29.6 [29] vctrs_0.3.6 IRanges_2.25.6 [31] stats4_4.1.0 bit64_4.0.5 [33] grid_4.1.0 GSEABase_1.53.1 [35] Biobase_2.51.0 R6_2.5.0 [37] HDF5Array_1.19.0 AnnotationDbi_1.53.0 [39] BiocParallel_1.25.2 XML_3.99-0.5 [41] irlba_2.3.3 BiocSingular_1.7.1 [43] Rhdf5lib_1.13.0 blob_1.2.1 [45] sparseMatrixStats_1.3.2 matrixStats_0.57.0 [47] BiocGenerics_0.37.0 GenomicRanges_1.43.3 [49] rsvd_1.0.3 beachmat_2.7.6 [51] SummarizedExperiment_1.21.1 xtable_1.8-4 [53] ScaledMatrix_0.99.2 RCurl_1.98-1.2 ```
rcastelo commented 3 years ago

Ugh! .. that's ugly. The use of fastmatch::fmatch() was first introduced in a PR by @assaron (copying him here so that he's aware about this side-effect) to speed up the operation of matching features in gene sets to features in an expression data matrix. Such an operation is done by iterating with lapply() through gene sets and at each iteration, matching the features in the gene set to the features (rownames) in the expression data matrix. The trick of fastmatch::fmatch() of building and storing a hash table greatly speeds up this (looping) operation, see the following benchmark:

library(GSVAdata)
library(fastmatch)
library(hgu95a.db)
library(bench)

data(c2BroadSets)
data(leukemia)

gsets <- geneIds(c2BroadSets)
length(gsets)
## [1] 3272
probeids <- featureNames(leukemia_eset)
genes <- select(hgu95a.db, keys=probeids, columns="ENTREZID")
genes <- unique(genes$ENTREZID)
length(genes)
## [1] 9668

mapgsetsmatch <- function(gsets, genes) {
  lapply(gsets, function(x, y) na.omit(match(x ,y)), genes) 
}

mapgsetsfmatch <- function(gsets, genes) {
  lapply(gsets, function(x, y) na.omit(fmatch(x ,y)), genes)
}

res <- bench::mark(mapgsetsmatch(gsets, genes),
                   mapgsetsfmatch(gsets, genes),
                   time_unit="s",
                   iterations=100)
res
# A tibble: 2 x 13
  expression                      min median `itr/sec` mem_alloc `gc/sec` n_itr
  <bch:expr>                    <dbl>  <dbl>     <dbl> <bch:byt>    <dbl> <int>
1 mapgsetsmatch(gsets, genes)  0.593  0.649       1.49  660.67MB    19.2    100
2 mapgsetsfmatch(gsets, genes) 0.0272 0.0288     28.3     7.79MB     5.93   100
# … with 6 more variables: n_gc <dbl>, total_time <dbl>, result <list>,
#   memory <list>, time <list>, gc <list>

As you can see, the improvement is not only in terms of speed, but also in terms of memory consumption. For this reason, i've decided to retain the call to fastmatch::fmatch() and duplicate the feature vector (trickier than I thought!) before doing this call to avoid affecting the original input feature vector. I've also factored out the code and you can find the fix at GSVA/R/utils.R. If you think this fix is sufficiently safe, I'll port it to the release version of GSVA.

cc'ing also @pablo-rodr-bio2 to make him aware about this issue since he's also working on GSVA, to enable it working with SingleCellExperiment objects and the DelayArray framework, thanks by the way for giving a hand to him with BiocSingular.

assaron commented 3 years ago

@rcastelo Thanks. I'm aware of this behavior, although I didn't expect it to break anything. Still, in fgsea we do a copy of the genes, so the original object is not modified.

LTLA commented 3 years ago

To round out the options, one may also consider the joys of a CharacterList:

# Added as.integer to remove attributes for mark().
mapgsetsmatch <- function(gsets, genes) {
  lapply(gsets, function(x, y) as.integer(na.omit(match(x ,y))), genes)
}

mapgsetsfmatch <- function(gsets, genes) {
  lapply(gsets, function(x, y) as.integer(na.omit(fmatch(x ,y))), genes)
}

gsets2 <- CharacterList(gsets)
mapgsetsunlist <- function(gsets2, genes) {
  m <- match(gsets2, genes)
  as.list(m[!is.na(m)])
}

res <- bench::mark(mapgsetsmatch(gsets, genes),
                   mapgsetsfmatch(gsets, genes),
                   mapgsetsunlist(gsets2, genes),
                   time_unit="s",
                   iterations=100)
## # A tibble: 3 x 13
##   expression                       min median `itr/sec` mem_alloc `gc/sec` n_itr
##   <bch:expr>                     <dbl>  <dbl>     <dbl> <bch:byt>    <dbl> <int>
## 1 mapgsetsmatch(gsets, genes)   1.01   1.09       0.898        NA     8.24   100
## 2 mapgsetsfmatch(gsets, genes)  0.0373 0.0398    22.5          NA     3.59   100
## 3 mapgsetsunlist(gsets2, genes) 0.0243 0.0284    29.2          NA     7.87   100
## # … with 6 more variables: n_gc <dbl>, total_time <dbl>, result <list>,
## #   memory <list>, time <list>, gc <list>

Slightly faster even after GC'ing like a madman.

I use this for most of my gene set processing, prints nicer too:

gsets2
## CharacterList of length 3272
## [["NAKAMURA_CANCER_MICROENVIRONMENT_UP"]] 5167 100288400 338328 ... 1278 3046
## [["NAKAMURA_CANCER_MICROENVIRONMENT_DN"]] 55215 9319 81610 ... 586 4141 55706
## [["WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP"]] 5142 6781 580 ... 9055 51514 259266
## [["WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN"]] 125 2619 5919 ... 4929 54361 1634
## [["WINTER_HYPOXIA_UP"]] 7022 404550 5738 9456 5230 ... 9989 51248 5817 55142
## [["WINTER_HYPOXIA_DN"]] 5168 9452 3112 91526 55843 ... 10550 7109 51284 4005
## [["PARENT_MTOR_SIGNALING_UP"]] 23510 23511 942 5230 ... 22820 1356 7009 901
## [["PARENT_MTOR_SIGNALING_DN"]] 1272 64772 10893 3082 ... 3892 7840 10622 5818
## [["PYEON_HPV_POSITIVE_TUMORS_UP"]] 8880 2146 6595 153579 ... 3978 4065 84250
## [["PYEON_HPV_POSITIVE_TUMORS_DN"]] 2017 595 11202 388533 ... 5655 375061 857
rcastelo commented 3 years ago

Nice, i've run it again with the memory profiling and the CharacterList option consumes more RAM than fmatch but GSVA already imports IRanges, so we would save the one dependency on fastmatch:

# A tibble: 3 x 13
  expression                       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                     <dbl>  <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl>
1 mapgsetsmatch(gsets, genes)   0.570  0.598       1.63  661.61MB    20.2    100  1238      61.4 
2 mapgsetsfmatch(gsets, genes)  0.0285 0.0295     29.8     8.74MB     8.05   100    27       3.36
3 mapgsetsunlist(gsets2, genes) 0.0182 0.0188     42.2    14.26MB    16.0    100    38       2.37
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

I'll probably switch to CharacterList, thanks for the hint!

rcastelo commented 3 years ago

Fixed by using the IRanges::match,CharacterList-method in both devel (1.39.15) and release (1.38.1). Closing this issue.