leekgroup / recount

R package for the recount2 project. Documentation website: http://leekgroup.github.io/recount/
https://jhubiostatistics.shinyapps.io/recount/
40 stars 9 forks source link

[BUG] Severe Slowdown due to rowRanges 'symbol' column in RSE datasets #22

Closed millerh1 closed 3 years ago

millerh1 commented 3 years ago

Hello,

I am usually a happy camper with recount2 and very thankful for all the work you all have put into this tool! That being said, I have noticed a bug recently in which using the RangedSummarizedExperiment objects from recount in R causes severe slowdowns to occur. I have tested this on multiple machines now and find this only happens with the RSE objects from recount and not with typical seq datasets that I analyze.

After a little while of digging into the objects, I noticed that the slowdown appears to be caused by the rowRanges of the recount2 objects, specifically the symbol column. Please see my example here:

library(recount)
library(DESeq2)
library(SummarizedExperiment)
rse_gene <- rse_gene_SRP009615

# Timer without ranges
system.time({
  rse <- SummarizedExperiment(
    assays =  assays(rse_gene),
    colData = colData(rse_gene), 
    # rowRanges = rowRanges(rse_gene)
  )
  rse$condition <- gsub(res$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
  dds <- DESeqDataSet(rse, design = ~condition)
  dds <- DESeq(dds)
})

# Timer with ranges
system.time({
  rse <- SummarizedExperiment(
    assays =  assays(rse_gene),
    colData = colData(rse_gene), 
    rowRanges = rowRanges(rse_gene)
  )
  rse$condition <- gsub(res$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
  dds <- DESeqDataSet(rse, design = ~condition)
  dds <- DESeq(dds)
})

# Timer with ranges but NULL symbol col
system.time({
  rse <- SummarizedExperiment(
    assays =  assays(rse_gene),
    colData = colData(rse_gene), 
    rowRanges = rowRanges(rse_gene)
  )
  rse$condition <- gsub(res$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
  rse@rowRanges$symbol <- NULL
  dds <- DESeqDataSet(rse, design = ~condition)
  dds <- DESeq(dds)
})

And here is the output of running this in the console:

> system.time({
+   rse <- SummarizedExperiment(
+     assays =  assays(rse_gene),
+     colData = colData(rse_gene), 
+     # rowRanges = rowRanges(rse_gene)
+   )
+   rse$condition <- gsub(res$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
+   dds <- DESeqDataSet(rse, design = ~condition)
+   dds <- DESeq(dds)
+ })
converting counts to integer mode
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed 
 30.018   0.040  30.070 
Warning message:
In DESeqDataSet(rse, design = ~condition) :
  some variables in design formula are characters, converting to factors
> system.time({
+   rse <- SummarizedExperiment(
+     assays =  assays(rse_gene),
+     colData = colData(rse_gene), 
+     rowRanges = rowRanges(rse_gene)
+   )
+   rse$condition <- gsub(res$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
+   dds <- DESeqDataSet(rse, design = ~condition)
+   dds <- DESeq(dds)
+ })
converting counts to integer mode
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed 
745.284   0.810 746.438 
Warning message:
In DESeqDataSet(rse, design = ~condition) :
  some variables in design formula are characters, converting to factors
> # Timer with ranges but NULL symbol col
> system.time({
+   rse <- SummarizedExperiment(
+     assays =  assays(rse_gene),
+     colData = colData(rse_gene), 
+     rowRanges = rowRanges(rse_gene)
+   )
+   rse$condition <- gsub(res$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
+   rse@rowRanges$symbol <- NULL
+   dds <- DESeqDataSet(rse, design = ~condition)
+   dds <- DESeq(dds)
+ })
converting counts to integer mode
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
   user  system elapsed 
 29.840   0.028  29.890 
Warning message:
In DESeqDataSet(rse, design = ~condition) :
  some variables in design formula are characters, converting to factors

As you can see, with the rowRanges included, the code took ~25x longer to finish running. However, simply NULLing the symbol column of the rowRanges was able to prevent this slowdown. I think the issue is that the symbol column was a CharacterList which may perform inefficiently for these application. Anyways, hopefully this helps -- thanks again for all the work you and your team does!!

Session info:

> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] recount_1.18.0              forcats_0.5.1               stringr_1.4.0               dplyr_1.0.7                
 [5] purrr_0.3.4                 readr_1.4.0                 tidyr_1.1.3                 tibble_3.1.2               
 [9] ggplot2_3.3.5               tidyverse_1.3.1             DESeq2_1.32.0               SummarizedExperiment_1.22.0
[13] MatrixGenerics_1.4.0        matrixStats_0.59.0          EnsDb.Hsapiens.v86_2.99.0   ensembldb_2.16.2           
[17] AnnotationFilter_1.16.0     GenomicFeatures_1.44.0      AnnotationDbi_1.54.1        Biobase_2.52.0             
[21] GenomicRanges_1.44.0        GenomeInfoDb_1.28.1         IRanges_2.26.0              S4Vectors_0.30.0           
[25] BiocGenerics_0.38.0         tximport_1.20.0            

loaded via a namespace (and not attached):
  [1] readxl_1.3.1             backports_1.2.1          Hmisc_4.5-0              BiocFileCache_2.0.0     
  [5] plyr_1.8.6               lazyeval_0.2.2           splines_4.1.0            BiocParallel_1.26.1     
  [9] digest_0.6.27            foreach_1.5.1            htmltools_0.5.1.1        fansi_0.5.0             
 [13] magrittr_2.0.1           checkmate_2.0.0          memoise_2.0.0            BSgenome_1.60.0         
 [17] cluster_2.1.2            limma_3.48.1             Biostrings_2.60.1        annotate_1.70.0         
 [21] modelr_0.1.8             prettyunits_1.1.1        jpeg_0.1-8.1             colorspace_2.0-2        
 [25] blob_1.2.1               rvest_1.0.0              rappdirs_0.3.3           haven_2.4.1             
 [29] xfun_0.24                crayon_1.4.1             RCurl_1.98-1.3           jsonlite_1.7.2          
 [33] genefilter_1.74.0        GEOquery_2.60.0          iterators_1.0.13         survival_3.2-11         
 [37] VariantAnnotation_1.38.0 glue_1.4.2               gtable_0.3.0             zlibbioc_1.38.0         
 [41] XVector_0.32.0           DelayedArray_0.18.0      rentrez_1.2.3            scales_1.1.1            
 [45] rngtools_1.5             DBI_1.1.1                derfinderHelper_1.26.0   derfinder_1.26.0        
 [49] Rcpp_1.0.7               xtable_1.8-4             progress_1.2.2           htmlTable_2.2.1         
 [53] bumphunter_1.34.0        foreign_0.8-81           bit_4.0.4                Formula_1.2-4           
 [57] htmlwidgets_1.5.3        httr_1.4.2               RColorBrewer_1.1-2       ellipsis_0.3.2          
 [61] pkgconfig_2.0.3          XML_3.99-0.6             farver_2.1.0             nnet_7.3-16             
 [65] dbplyr_2.1.1             locfit_1.5-9.4           utf8_1.2.1               reshape2_1.4.4          
 [69] tidyselect_1.1.1         labeling_0.4.2           rlang_0.4.11             munsell_0.5.0           
 [73] cellranger_1.1.0         tools_4.1.0              cachem_1.0.5             downloader_0.4          
 [77] cli_3.0.0                generics_0.1.0           RSQLite_2.2.7            broom_0.7.8             
 [81] fastmap_1.1.0            yaml_2.2.1               knitr_1.33               bit64_4.0.5             
 [85] fs_1.5.0                 KEGGREST_1.32.0          doRNG_1.8.2              xml2_1.3.2              
 [89] biomaRt_2.48.2           compiler_4.1.0           rstudioapi_0.13          filelock_1.0.2          
 [93] curl_4.3.2               png_0.1-7                reprex_2.0.0             geneplotter_1.70.0      
 [97] stringi_1.7.3            GenomicFiles_1.28.0      lattice_0.20-44          ProtGenerics_1.24.0     
[101] Matrix_1.3-4             vctrs_0.3.8              pillar_1.6.1             lifecycle_1.0.0         
[105] data.table_1.14.0        bitops_1.0-7             qvalue_2.24.0            rtracklayer_1.52.0      
[109] R6_2.5.0                 BiocIO_1.2.0             latticeExtra_0.6-29      gridExtra_2.3           
[113] codetools_0.2-18         assertthat_0.2.1         rjson_0.2.20             withr_2.4.2             
[117] GenomicAlignments_1.28.0 Rsamtools_2.8.0          GenomeInfoDbData_1.2.6   hms_1.1.0               
[121] grid_4.1.0               rpart_4.1-15             lubridate_1.7.10         base64enc_0.1-3         
[125] restfulr_0.0.13         
lcolladotor commented 3 years ago

Hi,

Thanks for the message. We provided all the symbol info from AnnotationDbi::mapIds() as shown at https://github.com/leekgroup/recount/blob/master/R/reproduce_ranges.R#L111-L114, which then enables users to choose which symbol they want to use for each gene. You can coerce it back to a regular character vector by for example, choosing the first symbol for each gene with rowRanges(rse)$symbol <- sapply(rowRanges(rse)$symbol, "[", 1), which is what AnnotationDbi::mapIds() will do if you want just 1 symbol per gene by default (it has several options).

> head(sapply(rowRanges(rse_gene)$symbol, "[", 1))
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
       "TSPAN6"          "TNMD"          "DPM1"         "SCYL3"      "C1orf112"
ENSG00000000938
          "FGR"

I adapted your code a bit (note the res$title vs rse$title difference too) and it works at comparable speeds on my slower laptop than your Ubuntu machine (based on 2 timed tests for each, which for benchmarking is not enough but well, we just wanted a general idea).

library(recount)
library(DESeq2)
library(SummarizedExperiment)
rse_gene <- rse_gene_SRP009615

## Coerce symbol to character
system.time({
  rse <- SummarizedExperiment(
    assays =  assays(rse_gene),
    colData = colData(rse_gene), 
    rowRanges = rowRanges(rse_gene)
  )
  rse$condition <- gsub(rse$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
  rowRanges(rse)$symbol <- sapply(rowRanges(rse)$symbol, "[", 1)
  dds <- DESeqDataSet(rse, design = ~condition)
  dds <- DESeq(dds)
})

# Timer without ranges
system.time({
  rse <- SummarizedExperiment(
    assays =  assays(rse_gene),
    colData = colData(rse_gene), 
    # rowRanges = rowRanges(rse_gene)
  )
  rse$condition <- gsub(rse$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1")
  dds <- DESeqDataSet(rse, design = ~condition)
  dds <- DESeq(dds)
})
```R > library(recount) > library(DESeq2) > library(SummarizedExperiment) > rse_gene <- rse_gene_SRP009615 > > system.time({ + rse <- SummarizedExperiment( + assays = assays(rse_gene), + colData = colData(rse_gene), + rowRanges = rowRanges(rse_gene) + ) + rse$condition <- gsub(rse$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1") + rowRanges(rse)$symbol <- sapply(rowRanges(rse)$symbol, "[", 1) + dds <- DESeqDataSet(rse, design = ~condition) + dds <- DESeq(dds) + }) converting counts to integer mode estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing user system elapsed 69.214 3.396 85.231 Warning message: In DESeqDataSet(rse, design = ~condition) : some variables in design formula are characters, converting to factors > library(recount) > library(DESeq2) > library(SummarizedExperiment) > rse_gene <- rse_gene_SRP009615 > > system.time({ + rse <- SummarizedExperiment( + assays = assays(rse_gene), + colData = colData(rse_gene), + rowRanges = rowRanges(rse_gene) + ) + rse$condition <- gsub(rse$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1") + rowRanges(rse)$symbol <- sapply(rowRanges(rse)$symbol, "[", 1) + dds <- DESeqDataSet(rse, design = ~condition) + dds <- DESeq(dds) + }) converting counts to integer mode estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing user system elapsed 67.985 5.780 88.274 Warning message: In DESeqDataSet(rse, design = ~condition) : some variables in design formula are characters, converting to factors > # Timer without ranges > system.time({ + rse <- SummarizedExperiment( + assays = assays(rse_gene), + colData = colData(rse_gene), + # rowRanges = rowRanges(rse_gene) + ) + rse$condition <- gsub(rse$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1") + dds <- DESeqDataSet(rse, design = ~condition) + dds <- DESeq(dds) + }) converting counts to integer mode estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing user system elapsed 65.179 4.396 79.016 Warning message: In DESeqDataSet(rse, design = ~condition) : some variables in design formula are characters, converting to factors > # Timer without ranges > system.time({ + rse <- SummarizedExperiment( + assays = assays(rse_gene), + colData = colData(rse_gene), + # rowRanges = rowRanges(rse_gene) + ) + rse$condition <- gsub(rse$title, pattern = ".+ targeting ([a-zA-Z0-9]+) gene.+", replacement = "sh\\1") + dds <- DESeqDataSet(rse, design = ~condition) + dds <- DESeq(dds) + }) converting counts to integer mode estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing rowRanges(rse) user system elapsed 68.221 5.267 86.310 Warning message: In DESeqDataSet(rse, design = ~condition) : some variables in design formula are characters, converting to factors > rowRanges(rse) NULL > packageVersion("recount") [1] ‘1.18.0’ > sessionInfo() R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 10.16 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] DESeq2_1.32.0 recount_1.18.0 [3] SummarizedExperiment_1.22.0 Biobase_2.52.0 [5] GenomicRanges_1.44.0 GenomeInfoDb_1.28.1 [7] IRanges_2.26.0 S4Vectors_0.30.0 [9] BiocGenerics_0.38.0 MatrixGenerics_1.4.0 [11] matrixStats_0.60.0 testthat_3.0.4 [13] devtools_2.4.2 usethis_2.0.1 loaded via a namespace (and not attached): [1] backports_1.2.1 Hmisc_4.5-0 BiocFileCache_2.0.0 [4] plyr_1.8.6 splines_4.1.0 BiocParallel_1.26.1 [7] ggplot2_3.3.5 digest_0.6.27 foreach_1.5.1 [10] htmltools_0.5.1.1 prompt_1.0.1 fansi_0.5.0 [13] magrittr_2.0.1 checkmate_2.0.0 memoise_2.0.0 [16] BSgenome_1.60.0 cluster_2.1.2 tzdb_0.1.2 [19] limma_3.48.1 remotes_2.4.0 annotate_1.70.0 [22] Biostrings_2.60.1 readr_2.0.0 prettyunits_1.1.1 [25] jpeg_0.1-9 colorspace_2.0-2 blob_1.2.2 [28] rappdirs_0.3.3 xfun_0.24 dplyr_1.0.7 [31] jsonlite_1.7.2 callr_3.7.0 crayon_1.4.1 [34] RCurl_1.98-1.3 genefilter_1.74.0 GEOquery_2.60.0 [37] survival_3.2-11 VariantAnnotation_1.38.0 iterators_1.0.13 [40] glue_1.4.2 gtable_0.3.0 zlibbioc_1.38.0 [43] XVector_0.32.0 DelayedArray_0.18.0 pkgbuild_1.2.0 [46] colorout_1.2-2 rentrez_1.2.3 scales_1.1.1 [49] DBI_1.1.1 rngtools_1.5 derfinderHelper_1.26.0 [52] derfinder_1.26.0 Rcpp_1.0.7 xtable_1.8-4 [55] progress_1.2.2 htmlTable_2.2.1 rsthemes_0.2.1.9000 [58] bumphunter_1.34.0 foreign_0.8-81 bit_4.0.4 [61] Formula_1.2-4 htmlwidgets_1.5.3 httr_1.4.2 [64] RColorBrewer_1.1-2 ellipsis_0.3.2 pkgconfig_2.0.3 [67] XML_3.99-0.6 nnet_7.3-16 dbplyr_2.1.1 [70] locfit_1.5-9.4 utf8_1.2.2 reshape2_1.4.4 [73] tidyselect_1.1.1 rlang_0.4.11 AnnotationDbi_1.54.1 [76] munsell_0.5.0 tools_4.1.0 cachem_1.0.5 [79] downloader_0.4 cli_3.0.1 generics_0.1.0 [82] RSQLite_2.2.7 stringr_1.4.0 fastmap_1.1.0 [85] yaml_2.2.1 processx_3.5.2 knitr_1.33 [88] bit64_4.0.5 fs_1.5.0 purrr_0.3.4 [91] KEGGREST_1.32.0 doRNG_1.8.2 xml2_1.3.2 [94] biomaRt_2.48.2 compiler_4.1.0 rstudioapi_0.13 [97] filelock_1.0.2 curl_4.3.2 png_0.1-7 [100] geneplotter_1.70.0 tibble_3.1.3 stringi_1.7.3 [103] ps_1.6.0 GenomicFeatures_1.44.0 desc_1.3.0 [106] GenomicFiles_1.28.0 lattice_0.20-44 Matrix_1.3-4 [109] vctrs_0.3.8 pillar_1.6.2 lifecycle_1.0.0 [112] data.table_1.14.0 bitops_1.0-7 qvalue_2.24.0 [115] rtracklayer_1.52.0 R6_2.5.0 BiocIO_1.2.0 [118] latticeExtra_0.6-29 gridExtra_2.3 sessioninfo_1.1.1 [121] codetools_0.2-18 assertthat_0.2.1 pkgload_1.2.1 [124] rprojroot_2.0.2 rjson_0.2.20 withr_2.4.2 [127] GenomicAlignments_1.28.0 Rsamtools_2.8.0 GenomeInfoDbData_1.2.6 [130] hms_1.1.0 grid_4.1.0 rpart_4.1-15 [133] tidyr_1.1.3 base64enc_0.1-3 restfulr_0.0.13 > ```

So well, here it's not really a recount bug. It's just a matter of how you use the information provided and how we provide more than you might need.

Best, Leo

millerh1 commented 3 years ago

Thank you for the update! I will do this in the future.