command readGAlignments (step 4 create_master_table wtc11) takes very long

giovanniquinones commented 2 years ago

Hi, I have been trying to reproduce the variant-calling benchmark analysis from your preprint (https://doi.org/10.1101/2022.02.08.479579 ), but when I get to step 4, more specifically the command "ReadGAlignments" it seems to be a very computationally-intensive step. Running on 8-cores and 24 Gb of memory, it still takes does not complete after 16 hrs.

I was just wondering if this sounds right, or if I missed something. I got all the VCF files and bam files from the Zenodo repository https://zenodo.org/record/6332914

These are the first lines of the R script mt_wtc11_allMethods_v7.R

METHOD_NAMES <- c("dv", "dv_s", "dv_s_fc",
                  "c3", "c3_s", "c3_s_fc", "c3_mix",
                  "gatk_s", "nc_s")

# name of the dataset used with the methods to validate
METHOD_DATASET_NAME <- "isoSeq"
# VCF files

METHOD_VCF_FILES <- c(
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-deepvariant-dv-deepvariant_calls_pass.vcf.gz",
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-deepvariant-dv_sncr-deepvariant_calls_pass.vcf.gz",
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-deepvariant-dv_sncr_fc-deepvariant_calls_pass.vcf.gz",
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-clair3-c3-pileup_pass.vcf.gz",
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-clair3-c3_sncr-pileup_pass.vcf.gz",
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-clair3-c3_sncr_fc-pileup_pass.vcf.gz",
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-clair3-mix-pileup_pass_mix_norep.recode.vcf.gz",
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-gatk-isoSeq_wtc11.recal_pass.vcf.gz",
  "3_call_variants/wtc11/wtc11-variant_calling_from_isoseq-nanocaller-nc_sncr-variant_calls.final.vcf.gz"
)
# BAM of the data
METHOD_BAM_FILE <- "3_call_variants/wtc11/1_before_variant_calling/wtc11-data_manipulation-aln_s.bam"

### ground-truth
# name
TRUTH_NAME <- "allen"
# name of the dataset used to generate the ground-truth
TRUTH_DATASET_NAME <- "shortRead"
# VCF file
TRUTH_VCF_FILE <- "2_generate_ground_truth/wtc11/ground_truth/3546dc62_AH77TTBBXX_DS-229105_GCCAAT_recalibrated_subsetChromosomes_pass.vcf.gz"

# BAM file
# downloaded by script /home/vbarbo/project_2021/projects/lrRNAseqVariantCalling/1_download_data/wtc11/not_filtered_ground_truth_vcf/wtc11_vcf.sh
TRUTH_BAM_FILE <- "1_download_data/wtc11/84773251_trimmed.AH77TTBBXX_DS-229105_GCCAAT.sorted.rg.final.bam"

### variables
MAX_DIST_FROM_SPLICE_SITE <- 20
THREADS <- 30
GENOME_REF_FILE <- "1_download_data/reference/GRCh38.p13_all_chr.fasta"

and this is the line that takes a long time: method_bam <- readGAlignments(METHOD_BAM_FILE)

Thank you!

vladimirsouza commented 2 years ago

Hi giovanniquinones,

Thank you for your interest in our paper.

The readGAlignments function is from the GenomicAlignments package. I downloaded the wtc11-data_manipulation-aln_s.bam file from our Zenodo repository and read it into R using readGAlignments and it takes less than 2 minutes. I do not know why it is taking so long for you run the function, but maybe the problem can be fixed by reinstalling the package GenomicAlignments. I used R version 4.0.5 and GenomicAlignments version 1.24.0. But I would recommend you install the newest version.

giovanniquinones commented 2 years ago

Thanks for your response. I am using R 4.1.0 and GenomicAlignments version 1.28.0 I believe the issue may be in the parallel processing of these functions. Could you please share your R sessionInfo() and/or BiocParallel registered() info or other relevant info to the parallel execution?

Thank you!

vladimirsouza commented 2 years ago

Hi giovanniquinones, I don't think this is due to the parallel processing. In that script to generate a master table, multiple cores are used only internally in the function get_splice_sites_info, but it is used after using the readGAlignments function. Here is my sessionInfo:

> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/local/R/R-4.0.5/lib/libRblas.so
LAPACK: /usr/local/R/R-4.0.5/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8
 [4] LC_COLLATE=en_CA.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] sarlacc_1.0.0               dplyr_1.0.8                 GenomicAlignments_1.24.0
 [4] Rsamtools_2.4.0             Biostrings_2.56.0           XVector_0.28.0
 [7] SummarizedExperiment_1.18.2 DelayedArray_0.14.1         matrixStats_0.56.0
[10] Biobase_2.48.0              GenomicRanges_1.40.0        GenomeInfoDb_1.24.2
[13] IRanges_2.22.2              S4Vectors_0.26.1            BiocGenerics_0.34.0
[16] lrRNAseqBenchmark_0.1.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3           ape_5.6-2              lattice_0.20-41
 [4] tidyr_1.2.0            png_0.1-7              assertthat_0.2.1
 [7] utf8_1.2.2             R6_2.5.1               ShortRead_1.46.0
[10] ggplot2_3.3.5          pillar_1.7.0           zlibbioc_1.34.0
[13] rlang_1.0.2            vegan_2.6-2            Matrix_1.4-1
[16] splines_4.0.5          pinfsc50_1.2.0         BiocParallel_1.22.0
[19] stringr_1.4.0          RCurl_1.98-1.2         munsell_0.5.0
[22] compiler_4.0.5         pkgconfig_2.0.3        mgcv_1.8-32
[25] tidyselect_1.1.2       tibble_3.1.6           GenomeInfoDbData_1.2.3
[28] memuse_4.2-1           fansi_1.0.3            permute_0.9-7
[31] viridisLite_0.4.0      crayon_1.5.1           MASS_7.3-52
[34] bitops_1.0-6           grid_4.0.5             nlme_3.1-149
[37] gtable_0.3.0           lifecycle_1.0.1        DBI_1.1.1
[40] magrittr_2.0.3         scales_1.2.0           cli_3.2.0
[43] stringi_1.7.6          hwriter_1.3.2          vcfR_1.12.0
[46] latticeExtra_0.6-29    snakecase_0.11.0       ellipsis_0.3.2
[49] generics_0.1.2         vctrs_0.4.1            RColorBrewer_1.1-3
[52] tools_4.0.5            glue_1.6.2             purrr_0.3.4
[55] jpeg_0.1-8.1           colorspace_2.0-3       cluster_2.1.0

giovanniquinones commented 2 years ago

I am sorry, that step does not take too long to run, it was the subsequent step get_splice_sites_info that takes very long, and it says so in the script. I apologize for the inconvenience.

vladimirsouza commented 2 years ago

Hi giovanniquinones, Sorry, but the get_splice_sites_info function really takes a long time to run. Could you run it until it finishes? If you want, I can make a file with its output and make it available.

> splice_sites <- get_splice_sites_info(method_bam, THREADS)
> splice_sites
# A tibble: 601,795 × 4
# Groups:   chrm, pos [601,536]
   chrm    pos is_acceptor_site     n
   <chr> <int>            <int> <int>
 1 chr1  14829                0    93
 2 chr1  14930                1     1
 3 chr1  14970                1    91
 4 chr1  15012                0     1
 5 chr1  15038                0    91
 6 chr1  15311                1     1
 7 chr1  15539                0     1
 8 chr1  15796                1    90
 9 chr1  15947                0    57
10 chr1  16027                0     2
# … with 601,785 more rows

giovanniquinones commented 2 years ago

I was able to run the entire script and obtain Fig 1 (so far). But yes that step takes a while. This is the code I have:

print(paste(Sys.time(), "LOG: readGAlignments", sep = ' '))
method_bam   <- readGAlignments(METHOD_BAM_FILE)
print(paste(Sys.time(), "LOG: get_splice_sites", sep = ' '))
splice_sites <- get_splice_sites_info(method_bam, THREADS)
print(paste(Sys.time(), "LOG: add_splice_sites", sep = ' '))

This is the output:

[1] "2022-06-14 21:06:07 LOG: readGAlignments"
[1] "2022-06-14 21:07:29 LOG: get_splice_sites"
[1] "2022-06-15 18:49:34 LOG: add_splice_sites"

As you can see it took almost 22 hrs to run, but it did finish running and I get the same output.

I also get these warnings in the log file, maybe they have something to do:

Warning messages:
1: replacing previous import 'IRanges::collapse' by 'dplyr::collapse' when loading 'lrRNAseqBenchmark' 
2: replacing previous import 'IRanges::union' by 'dplyr::union' when loading 'lrRNAseqBenchmark' 
3: replacing previous import 'IRanges::slice' by 'dplyr::slice' when loading 'lrRNAseqBenchmark' 
4: replacing previous import 'IRanges::intersect' by 'dplyr::intersect' when loading 'lrRNAseqBenchmark' 
5: replacing previous import 'IRanges::setdiff' by 'dplyr::setdiff' when loading 'lrRNAseqBenchmark' 
6: replacing previous import 'IRanges::desc' by 'dplyr::desc' when loading 'lrRNAseqBenchmark'

My rough guess is that e.g. the efficient collapse function from IRanges is getting masked by a the less efficient collapse function form dplyr

vladimirsouza commented 2 years ago

Hi giovanniquinones, Thank you very much for your feedback. I made a change in the function get_splice_sites_info and now it is almost 30 times faster. I have updated it in the repo lrRNAseqBenchmark. The problem was the use of dplyr pipes (%>%) inside the function.

vladimirsouza / lrRNAseqVariantCalling

command readGAlignments (step 4 create_master_table wtc11) takes very long #3