Closed LTLA closed 3 years ago
Ugh! .. that's ugly. The use of fastmatch::fmatch()
was first introduced in a PR by @assaron (copying him here so that he's aware about this side-effect) to speed up the operation of matching features in gene sets to features in an expression data matrix. Such an operation is done by iterating with lapply()
through gene sets and at each iteration, matching the features in the gene set to the features (rownames) in the expression data matrix. The trick of fastmatch::fmatch()
of building and storing a hash table greatly speeds up this (looping) operation, see the following benchmark:
library(GSVAdata)
library(fastmatch)
library(hgu95a.db)
library(bench)
data(c2BroadSets)
data(leukemia)
gsets <- geneIds(c2BroadSets)
length(gsets)
## [1] 3272
probeids <- featureNames(leukemia_eset)
genes <- select(hgu95a.db, keys=probeids, columns="ENTREZID")
genes <- unique(genes$ENTREZID)
length(genes)
## [1] 9668
mapgsetsmatch <- function(gsets, genes) {
lapply(gsets, function(x, y) na.omit(match(x ,y)), genes)
}
mapgsetsfmatch <- function(gsets, genes) {
lapply(gsets, function(x, y) na.omit(fmatch(x ,y)), genes)
}
res <- bench::mark(mapgsetsmatch(gsets, genes),
mapgsetsfmatch(gsets, genes),
time_unit="s",
iterations=100)
res
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int>
1 mapgsetsmatch(gsets, genes) 0.593 0.649 1.49 660.67MB 19.2 100
2 mapgsetsfmatch(gsets, genes) 0.0272 0.0288 28.3 7.79MB 5.93 100
# … with 6 more variables: n_gc <dbl>, total_time <dbl>, result <list>,
# memory <list>, time <list>, gc <list>
As you can see, the improvement is not only in terms of speed, but also in terms of memory consumption. For this reason, i've decided to retain the call to fastmatch::fmatch()
and duplicate the feature vector (trickier than I thought!) before doing this call to avoid affecting the original input feature vector. I've also factored out the code and you can find the fix at GSVA/R/utils.R. If you think this fix is sufficiently safe, I'll port it to the release version of GSVA.
cc'ing also @pablo-rodr-bio2 to make him aware about this issue since he's also working on GSVA, to enable it working with SingleCellExperiment
objects and the DelayArray
framework, thanks by the way for giving a hand to him with BiocSingular.
@rcastelo Thanks. I'm aware of this behavior, although I didn't expect it to break anything. Still, in fgsea we do a copy of the genes, so the original object is not modified.
To round out the options, one may also consider the joys of a CharacterList
:
# Added as.integer to remove attributes for mark().
mapgsetsmatch <- function(gsets, genes) {
lapply(gsets, function(x, y) as.integer(na.omit(match(x ,y))), genes)
}
mapgsetsfmatch <- function(gsets, genes) {
lapply(gsets, function(x, y) as.integer(na.omit(fmatch(x ,y))), genes)
}
gsets2 <- CharacterList(gsets)
mapgsetsunlist <- function(gsets2, genes) {
m <- match(gsets2, genes)
as.list(m[!is.na(m)])
}
res <- bench::mark(mapgsetsmatch(gsets, genes),
mapgsetsfmatch(gsets, genes),
mapgsetsunlist(gsets2, genes),
time_unit="s",
iterations=100)
## # A tibble: 3 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr
## <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int>
## 1 mapgsetsmatch(gsets, genes) 1.01 1.09 0.898 NA 8.24 100
## 2 mapgsetsfmatch(gsets, genes) 0.0373 0.0398 22.5 NA 3.59 100
## 3 mapgsetsunlist(gsets2, genes) 0.0243 0.0284 29.2 NA 7.87 100
## # … with 6 more variables: n_gc <dbl>, total_time <dbl>, result <list>,
## # memory <list>, time <list>, gc <list>
Slightly faster even after GC'ing like a madman.
I use this for most of my gene set processing, prints nicer too:
gsets2
## CharacterList of length 3272
## [["NAKAMURA_CANCER_MICROENVIRONMENT_UP"]] 5167 100288400 338328 ... 1278 3046
## [["NAKAMURA_CANCER_MICROENVIRONMENT_DN"]] 55215 9319 81610 ... 586 4141 55706
## [["WEST_ADRENOCORTICAL_TUMOR_MARKERS_UP"]] 5142 6781 580 ... 9055 51514 259266
## [["WEST_ADRENOCORTICAL_TUMOR_MARKERS_DN"]] 125 2619 5919 ... 4929 54361 1634
## [["WINTER_HYPOXIA_UP"]] 7022 404550 5738 9456 5230 ... 9989 51248 5817 55142
## [["WINTER_HYPOXIA_DN"]] 5168 9452 3112 91526 55843 ... 10550 7109 51284 4005
## [["PARENT_MTOR_SIGNALING_UP"]] 23510 23511 942 5230 ... 22820 1356 7009 901
## [["PARENT_MTOR_SIGNALING_DN"]] 1272 64772 10893 3082 ... 3892 7840 10622 5818
## [["PYEON_HPV_POSITIVE_TUMORS_UP"]] 8880 2146 6595 153579 ... 3978 4065 84250
## [["PYEON_HPV_POSITIVE_TUMORS_DN"]] 2017 595 11202 388533 ... 5655 375061 857
Nice, i've run it again with the memory profiling and the CharacterList
option consumes more RAM than fmatch
but GSVA
already imports IRanges
, so we would save the one dependency on fastmatch
:
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int> <dbl> <dbl>
1 mapgsetsmatch(gsets, genes) 0.570 0.598 1.63 661.61MB 20.2 100 1238 61.4
2 mapgsetsfmatch(gsets, genes) 0.0285 0.0295 29.8 8.74MB 8.05 100 27 3.36
3 mapgsetsunlist(gsets2, genes) 0.0182 0.0188 42.2 14.26MB 16.0 100 38 2.37
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
I'll probably switch to CharacterList
, thanks for the hint!
Fixed by using the IRanges::match,CharacterList-method
in both devel (1.39.15) and release (1.38.1). Closing this issue.
Taken from
?gsva
:You can see how calling
gsva
modifiesrownames(y)
in place. This is very unexpected; R functions should not exhibit side effects like this, given thaty
is not an environment or reference class.The addition of attributes breaks downstream functions like the
SummarizedExperiment
constructor:Why not just use
base::match()
? Surely the speed improvement, if any, cannot be so fantastic as to tolerate this behavior.Session information
``` R Under development (unstable) (2020-11-09 r79409) Platform: x86_64-apple-darwin19.5.0 (64-bit) Running under: macOS Catalina 10.15.5 Matrix products: default BLAS: /Users/luna/Software/R/trunk/lib/libRblas.dylib LAPACK: /Users/luna/Software/R/trunk/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GSVA_1.39.11 limma_3.47.3 loaded via a namespace (and not attached): [1] Rcpp_1.0.6 compiler_4.1.0 [3] GenomeInfoDb_1.27.3 XVector_0.31.1 [5] DelayedMatrixStats_1.13.4 rhdf5filters_1.3.3 [7] MatrixGenerics_1.3.0 bitops_1.0-6 [9] tools_4.1.0 zlibbioc_1.37.0 [11] SingleCellExperiment_1.13.4 digest_0.6.27 [13] bit_4.0.4 rhdf5_2.35.0 [15] lattice_0.20-41 annotate_1.69.0 [17] RSQLite_2.2.2 memoise_1.1.0 [19] rlang_0.4.10 fastmatch_1.1-0 [21] Matrix_1.3-2 graph_1.69.0 [23] DelayedArray_0.17.7 DBI_1.1.1 [25] parallel_4.1.0 GenomeInfoDbData_1.2.4 [27] httr_1.4.2 S4Vectors_0.29.6 [29] vctrs_0.3.6 IRanges_2.25.6 [31] stats4_4.1.0 bit64_4.0.5 [33] grid_4.1.0 GSEABase_1.53.1 [35] Biobase_2.51.0 R6_2.5.0 [37] HDF5Array_1.19.0 AnnotationDbi_1.53.0 [39] BiocParallel_1.25.2 XML_3.99-0.5 [41] irlba_2.3.3 BiocSingular_1.7.1 [43] Rhdf5lib_1.13.0 blob_1.2.1 [45] sparseMatrixStats_1.3.2 matrixStats_0.57.0 [47] BiocGenerics_0.37.0 GenomicRanges_1.43.3 [49] rsvd_1.0.3 beachmat_2.7.6 [51] SummarizedExperiment_1.21.1 xtable_1.8-4 [53] ScaledMatrix_0.99.2 RCurl_1.98-1.2 ```