Closed kashiff007 closed 3 years ago
I solve the this problem by just assigning the genome to reference_file variable:
reference_file <- ("Genome_CC7_new/CC7_R6_locked.softmasked.scaffolds.fa")
But while running:
snp.list <- extract_bams(bam_files, vcf_files, sample_names, cores=10, reference_file, coverage = 2)
I got the following error:
Reading reference file
Running sample CC7_25_1
Reading VCF file
Scanning file to determine attributes.
File attributes:
meta lines: 289
header_line: 290
variant count: 36668556
column count: 10
Meta line 289 read in.
All meta lines processed.
gt matrix initialized.
Character matrix gt created.
Character matrix gt rows: 36668556
Character matrix gt cols: 10
skip: 0
nrows: 36668556
row_num: 0
Processed variant 30978000
*** caught segfault ***
address 0x110, cause 'memory not mapped'
Traceback:
1: .read_body_gz(file, stats = stats, nrows = nrows, skip = skip, cols = cols, convertNA = as.numeric(convertNA), verbose = as.numeric(verbose))
2: vcfR::read.vcfR(vcfFiles[samp], verbose = verbose)
3: vcfR::getFIX(vcfR::read.vcfR(vcfFiles[samp], verbose = verbose))
4: FUN(X[[i]], ...)
5: lapply(sample_list, function(samp) { message(sprintf("Running sample %s", sampleNames[samp])) if (verbose) message("Reading VCF file", appendLF = TRUE) vcf <- vcfR::getFIX(vcfR::read.vcfR(vcfFiles[samp], verbose = verbose)) bam.file <- bamFiles[samp] if (verbose) message("Extracting methylation per SNP", appendLF = TRUE) snp.table <- parallel::mcmapply(function(t, u, v) { snp <- GRanges(seqnames = t, IRanges(start = as.integer(u), width = 1)) if (length(grep(t, c(as.character(1:22), "X", "Y"))) == 0) { if (verbose) message("Bad chrom", appendLF = TRUE) return(NULL) } chrom <- t params <- Rsamtools::ScanBamParam(tag = c("MD", "XM", "XR", "XG"), which = snp) alns.pairs <- readGAlignmentPairs(bam.file, use.names = TRUE, param = params) alns <- unlist(alns.pairs) split <- splitReads(alns, v, snp) alt.reads <- split$alt.reads ref.reads <- split$ref.reads if (length(ref.reads) <= 1 | length(alt.reads) <= 1) { if (verbose) message("0 or 1 read in one or both alleles", appendLF = TRUE) return(NULL) } left <- min(start(alns)) right <- max(end(alns)) window <- GRanges(GenomeInfoDb::seqnames(snp), IRanges(left, right)) dna <- Rsamtools::scanFa(fa, param = window) cgsite <- stringr::str_locate_all(dna, "CG")[[1]][, 1] if (length(cgsite) < 1) { if (verbose) message("No CpG sites associated to this SNP", appendLF = TRUE) return(NULL) } mepos <- cgsite/Biostrings::nchar(dna) conversion <- vapply(alns.pairs, function(x) { XGcondition <- mcols(x)$XG[1] if (XGcondition == "GA") { cgsite <- cgsite + 1 } conversion <- matrix(NA, 1, length(cgsite)) read.start <- start(x) - left + 1 read.end <- end(x) - left + 1 for (pair in 1:2) { something <- mcols(x[pair])$MD tag <- getMD(something) MDtag <- tag$MDtag nucl.num <- tag$nucl.num count <- 0 this.mepos <- cgsite - read.start[pair] + 1 for (p in seq_along(this.mepos)) { if (is.na(conversion[1, p])) { if (cgsite[p] > read.end[pair] || cgsite[p] < read.start[pair]) { conversion[1, p] <- NA next } else { for (i in seq_along(nucl.num)) { count <- count + nucl.num[i] if (count >= this.mepos[p]) { count <- 0 break } } if (MDtag[i] %in% c("C", "G")) { conversion[1, p] <- 1 } else { conversion[1, p] <- 2 } } } else { next } } } return(conversion) }, double(length(cgsite))) if (is.null(dim(conversion))) { ref.conv <- conversion[names(conversion) %in% ref.reads] alt.conv <- conversion[names(conversion) %in% alt.reads] ref.cov <- sum(!is.na(ref.conv)) alt.cov <- sum(!is.na(alt.conv)) ref.meth <- sum(!is.na(ref.conv) & ref.conv == 2) alt.meth <- sum(!is.na(alt.conv) & alt.conv == 2) } else { ref.conv <- conversion[, colnames(conversion) %in% ref.reads] alt.conv <- conversion[, colnames(conversion) %in% alt.reads] ref.cov <- BiocGenerics::rowSums(!is.na(ref.conv)) alt.cov <- BiocGenerics::rowSums(!is.na(alt.conv)) ref.meth <- BiocGenerics::rowSums(!is.na(ref.conv) & ref.conv == 2) alt.meth <- BiocGenerics::rowSums(!is.na(alt.conv) & alt.conv == 2) } GR <- GRanges(gsub("chr", "", chrom), IRanges(start = left + cgsite, width = 1)) mcols(GR)$cov.ref <- ref.cov mcols(GR)$cov.alt <- alt.cov mcols(GR)$meth.ref <- ref.meth mcols(GR)$meth.alt <- alt.meth mcols(GR)$snp <- paste0(GenomeInfoDb::seqnames(snp), ".", start(snp)) filt <- BiocGenerics::rowSums(cbind(GR$cov.ref, GR$cov.alt) >= coverage) >= 2 if (sum(filt) < 2) { if (verbose) message("No CpG sites sufficiently covered at this SNP", appendLF = TRUE) return(NULL) } gr <- GR[filt] return(gr) }, t = vcf[, 1], u = vcf[, 2], v = vcf[, 4], SIMPLIFY = F, USE.NAMES = F, mc.cores = cores) message(sprintf("Done with sample %s", sampleNames[samp])) return(snp.table)})
6: extract_bams(bam_files, vcf_files, sample_names, reference_file, coverage = 2)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:
What is the possible mistake I am doing?
Thanks in advance, Kashif
Hi @kashiff007, sorry for the late response, I wasn't getting any emails for some reason.
Yes for the first error, you basically have to provide the location of the reference file you used to do the alignment.
From the second error, it seems that you are using an older version of the package. Could you please install directly from github: BiocManager::install("markrobinsonuzh/DAMEfinder")
? or wait until the release version on bioconductor is updated to 1.0.3 https://bioconductor.org/packages/release/bioc/html/DAMEfinder.html, which should happen today or tomorrow.
Thank for the reply @sorjuela I have installed 1.0.3 version directly from github. For this, I have to update my R to 4.0.0 and bioconductor to 3.11.
All steps for SNP-based methods went well and after running:
snp.list <- extract_bams(bam_files, vcf_files, sample_names, reference_file, coverage = 2, cores=30)
following verbose printed:
Reading reference file
Running sample CC7_25_1
Reading VCF file
Extracting methylation per SNP
No CpG sites associated to this SNP
No CpG sites associated to this SNP
No CpG sites sufficiently covered at this SNP
No CpG sites sufficiently covered at this SNP
Bad chrom
Bad chrom
continues like this...
I am not sure whats the meaning of 'No CpG sites associated to this SNP', 'No CpG sites sufficiently covered at this SNP' and 'Bad chrom'. Because after this step completes snp.list$CC7_25_1[[1]] shows NULL (in every case its NULL). I am wondering whether I am directing towards right path. Could you comment on this?
I have made changes to attain same chromosome name style in my BAM, BAM index, vcf and genome.
Hi @kashiff007
Can you post the output from sessionInfo()
please?
Here it is.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.2
And for DAMEfinder
> sessionInfo(package = "DAMEfinder")
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.7 LTS
Matrix products: default
BLAS: /usr/local/lib/R/lib/libRblas.so
LAPACK: /usr/local/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
character(0)
other attached packages:
[1] DAMEfinder_1.1.3
loaded via a namespace (and not attached):
[1] pillar_1.4.6 compiler_4.0.2
[3] GenomeInfoDb_1.24.2 XVector_0.28.0
[5] methods_4.0.2 bitops_1.0-6
[7] utils_4.0.2 tools_4.0.2
[9] grDevices_4.0.2 zlibbioc_1.34.0
[11] lifecycle_0.2.0 tibble_3.0.3
[13] gtable_0.3.0 lattice_0.20-41
[15] pkgconfig_2.0.3 rlang_0.4.7
[17] Matrix_1.2-18 DelayedArray_0.14.1
[19] parallel_4.0.2 GenomeInfoDbData_1.2.3
[21] stringr_1.4.0 dplyr_1.0.2
[23] generics_0.0.2 Biostrings_2.56.0
[25] vctrs_0.3.4 graphics_4.0.2
[27] S4Vectors_0.26.1 datasets_4.0.2
[29] stats_4.0.2 IRanges_2.22.2
[31] stats4_4.0.2 grid_4.0.2
[33] tidyselect_1.1.0 glue_1.4.2
[35] base_4.0.2 Biobase_2.48.0
[37] R6_2.4.1 BiocParallel_1.22.0
[39] ggplot2_3.3.2 purrr_0.3.4
[41] magrittr_1.5 Rsamtools_2.4.0
[43] scales_1.1.1 ellipsis_0.3.1
[45] matrixStats_0.57.0 BiocGenerics_0.34.0
[47] GenomicAlignments_1.24.0 GenomicRanges_1.40.0
[49] SummarizedExperiment_1.18.2 colorspace_1.4-1
[51] stringi_1.5.3 RCurl_1.98-1.2
[53] munsell_0.5.0 crayon_1.3.4
Reading reference file Running sample CC7_25_1 Reading VCF file Extracting methylation per SNP No CpG sites associated to this SNP No CpG sites associated to this SNP No CpG sites sufficiently covered at this SNP No CpG sites sufficiently covered at this SNP Bad chrom Bad chrom continues like this...
I think you are still running and old version of the package, this Bad chrom
warning was removed in the latest version, so I'm a bit confused. If you load the package, and then run sessionInfo()
is it the same?
I have again updated the DAMEfinder and now the 'Bad chrom' is gone. And now only 'No CpG sites associated to this SNP' and 'No CpG sites sufficiently covered at this SNP' are appearing on the screen. Does it mean no error and I am moving in right direction?
yes, you can ignore the messages, it's basically going through each SNP and checking if that SNP has a CpG in the same read pair. Might take some time to finish depending on the number of SNPs in your VCF file.
Hi,
I know from vignette how to upload one chr/scaffold into reference_file
reference_file <- DNAStringSet(genome[[19]], use.names = TRUE)
But having problem in uploading whole genome fasta file containing multiple chr/scaffolds. My genome looks like:
I also looked into manual of extract_bams, but couldn't found the solution.
Can you provide a solution/command by which I can perform this operation.
Thank you, Kashif