Closed giovanniquinones closed 1 year ago
Hi giovanniquinones,
Thank you for your interest in our paper.
The readGAlignments
function is from the GenomicAlignments
package. I downloaded the wtc11-data_manipulation-aln_s.bam
file from our Zenodo repository and read it into R
using readGAlignments
and it takes less than 2 minutes. I do not know why it is taking so long for you run the function, but maybe the problem can be fixed by reinstalling the package GenomicAlignments
. I used R
version 4.0.5 and GenomicAlignments
version 1.24.0. But I would recommend you install the newest version.
Thanks for your response.
I am using R 4.1.0 and GenomicAlignments
version 1.28.0
I believe the issue may be in the parallel processing of these functions. Could you please share your R sessionInfo() and/or BiocParallel registered() info or other relevant info to the parallel execution?
Thank you!
Hi giovanniquinones,
I don't think this is due to the parallel processing. In that script to generate a master table, multiple cores are used only internally in the function get_splice_sites_info, but it is used after using the readGAlignments
function.
Here is my sessionInfo:
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS: /usr/local/R/R-4.0.5/lib/libRblas.so
LAPACK: /usr/local/R/R-4.0.5/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] sarlacc_1.0.0 dplyr_1.0.8 GenomicAlignments_1.24.0
[4] Rsamtools_2.4.0 Biostrings_2.56.0 XVector_0.28.0
[7] SummarizedExperiment_1.18.2 DelayedArray_0.14.1 matrixStats_0.56.0
[10] Biobase_2.48.0 GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
[13] IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0
[16] lrRNAseqBenchmark_0.1.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.8.3 ape_5.6-2 lattice_0.20-41
[4] tidyr_1.2.0 png_0.1-7 assertthat_0.2.1
[7] utf8_1.2.2 R6_2.5.1 ShortRead_1.46.0
[10] ggplot2_3.3.5 pillar_1.7.0 zlibbioc_1.34.0
[13] rlang_1.0.2 vegan_2.6-2 Matrix_1.4-1
[16] splines_4.0.5 pinfsc50_1.2.0 BiocParallel_1.22.0
[19] stringr_1.4.0 RCurl_1.98-1.2 munsell_0.5.0
[22] compiler_4.0.5 pkgconfig_2.0.3 mgcv_1.8-32
[25] tidyselect_1.1.2 tibble_3.1.6 GenomeInfoDbData_1.2.3
[28] memuse_4.2-1 fansi_1.0.3 permute_0.9-7
[31] viridisLite_0.4.0 crayon_1.5.1 MASS_7.3-52
[34] bitops_1.0-6 grid_4.0.5 nlme_3.1-149
[37] gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.1
[40] magrittr_2.0.3 scales_1.2.0 cli_3.2.0
[43] stringi_1.7.6 hwriter_1.3.2 vcfR_1.12.0
[46] latticeExtra_0.6-29 snakecase_0.11.0 ellipsis_0.3.2
[49] generics_0.1.2 vctrs_0.4.1 RColorBrewer_1.1-3
[52] tools_4.0.5 glue_1.6.2 purrr_0.3.4
[55] jpeg_0.1-8.1 colorspace_2.0-3 cluster_2.1.0
I am sorry, that step does not take too long to run, it was the subsequent step get_splice_sites_info
that takes very long, and it says so in the script.
I apologize for the inconvenience.
Hi giovanniquinones,
Sorry, but the get_splice_sites_info
function really takes a long time to run. Could you run it until it finishes?
If you want, I can make a file with its output and make it available.
> splice_sites <- get_splice_sites_info(method_bam, THREADS)
> splice_sites
# A tibble: 601,795 × 4
# Groups: chrm, pos [601,536]
chrm pos is_acceptor_site n
<chr> <int> <int> <int>
1 chr1 14829 0 93
2 chr1 14930 1 1
3 chr1 14970 1 91
4 chr1 15012 0 1
5 chr1 15038 0 91
6 chr1 15311 1 1
7 chr1 15539 0 1
8 chr1 15796 1 90
9 chr1 15947 0 57
10 chr1 16027 0 2
# … with 601,785 more rows
I was able to run the entire script and obtain Fig 1 (so far). But yes that step takes a while. This is the code I have:
print(paste(Sys.time(), "LOG: readGAlignments", sep = ' '))
method_bam <- readGAlignments(METHOD_BAM_FILE)
print(paste(Sys.time(), "LOG: get_splice_sites", sep = ' '))
splice_sites <- get_splice_sites_info(method_bam, THREADS)
print(paste(Sys.time(), "LOG: add_splice_sites", sep = ' '))
This is the output:
[1] "2022-06-14 21:06:07 LOG: readGAlignments"
[1] "2022-06-14 21:07:29 LOG: get_splice_sites"
[1] "2022-06-15 18:49:34 LOG: add_splice_sites"
As you can see it took almost 22 hrs to run, but it did finish running and I get the same output.
I also get these warnings in the log file, maybe they have something to do:
Warning messages:
1: replacing previous import 'IRanges::collapse' by 'dplyr::collapse' when loading 'lrRNAseqBenchmark'
2: replacing previous import 'IRanges::union' by 'dplyr::union' when loading 'lrRNAseqBenchmark'
3: replacing previous import 'IRanges::slice' by 'dplyr::slice' when loading 'lrRNAseqBenchmark'
4: replacing previous import 'IRanges::intersect' by 'dplyr::intersect' when loading 'lrRNAseqBenchmark'
5: replacing previous import 'IRanges::setdiff' by 'dplyr::setdiff' when loading 'lrRNAseqBenchmark'
6: replacing previous import 'IRanges::desc' by 'dplyr::desc' when loading 'lrRNAseqBenchmark'
My rough guess is that e.g. the efficient collapse
function from IRanges is getting masked by a the less efficient collapse
function form dplyr
Hi giovanniquinones,
Thank you very much for your feedback. I made a change in the function get_splice_sites_info
and now it is almost 30 times faster. I have updated it in the repo lrRNAseqBenchmark.
The problem was the use of dplyr
pipes (%>%
) inside the function.
Hi, I have been trying to reproduce the variant-calling benchmark analysis from your preprint (https://doi.org/10.1101/2022.02.08.479579 ), but when I get to step 4, more specifically the command "ReadGAlignments" it seems to be a very computationally-intensive step. Running on 8-cores and 24 Gb of memory, it still takes does not complete after 16 hrs.
I was just wondering if this sounds right, or if I missed something. I got all the VCF files and bam files from the Zenodo repository https://zenodo.org/record/6332914
These are the first lines of the R script
mt_wtc11_allMethods_v7.R
and this is the line that takes a long time:
method_bam <- readGAlignments(METHOD_BAM_FILE)
Thank you!