vibansal / crisp

Code for multi-sample variant calling from sequence data of pooled or unpooled DNA samples
MIT License
19 stars 8 forks source link

Segfault and wrong chromosome offset while using --region argument #16

Closed THYeh44 closed 4 years ago

THYeh44 commented 4 years ago
  1. The first problem is Segmentation fault which has been reported in 2015. It seems that Segfault happened stochastically in my case since sometimes it appears right after the command is entered, and sometimes it showed up after ~20-30 mins run.

  2. Because Segfault doesn't always happen right after the run, so I'm trying to resolve this by using --region argument and that brings up the second problem: before chromosome four everything works fine, but when entering "--region V", instead of reading read chromosome five but it jumps to chromosome X (in the .vcf and .log file); and "--region X" actually reads mitochondria DNA (C. elegans genome, I-V, X, +MtDNA). Here's the command line and result when I use --region IV, --region V, --region X ,and --MtDNA.

--region IV:
./CRISP-genotypes --bams bamlist.txt  --ref Caenorhabditis_elegans.WBcel235.dna.toplevel.fa --VCF VariantsCall.vcf --leftalign 1 --EM 1 **--regions IV** > variantcalls.log

insertions and deletions in reads will be left justified w.r.t. reference sequence (limited evaluation) 

EM algorithm will be used for estimating genotypes and calling variants

CRISP options: QVoffset 33 min_base_quality 13 min_mapping_score 20 max_permutations 20000 poolsize 2 CT-pvalue-thresh -3.5 QV-pvalue-thresh -5.0

reading fasta index file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.fai ... fasta file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa has 7 chromosomes/contigs

processing 6 bamfiles: eba3609e4ffc4b38839905c3701e70ac.sorted.bam ..... a4b213a41072415db4e36c7c929a5124.sorted.bam 

poolsizes are variable and read from file: 36...440
reading bam file index: eba3609e4ffc4b38839905c3701e70ac.sorted.bam.bai **region 3:0-536870912**
reading bam file index: a4b213a41072415db4e36c7c929a5124.sorted.bam.bai **region 3:0-536870912**
finished reading bam file indexes 
reading **chromosome IV offset 44871489** 
.....processed 1000000 reads QSIZE:2600 IV:103394:103545 variants called 26
.....processed 2000000 reads QSIZE:3041 IV:212012:212163 variants called 81
.....processed 3000000 reads QSIZE:2540 IV:320242:320393 variants called 190
--region V:
./CRISP-genotypes --bams bamlist.txt  --ref Caenorhabditis_elegans.WBcel235.dna.toplevel.fa --VCF VariantsCall.vcf --leftalign 1 --EM 1 **--regions V** > variantcalls.log

insertions and deletions in reads will be left justified w.r.t. reference sequence (limited evaluation) 

EM algorithm will be used for estimating genotypes and calling variants

CRISP options: QVoffset 33 min_base_quality 13 min_mapping_score 20 max_permutations 20000 poolsize 2 CT-pvalue-thresh -3.5 QV-pvalue-thresh -5.0

reading fasta index file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.fai ... fasta file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa has 7 chromosomes/contigs

processing 6 bamfiles: eba3609e4ffc4b38839905c3701e70ac.sorted.bam ..... a4b213a41072415db4e36c7c929a5124.sorted.bam 

poolsizes are variable and read from file: 36...440
reading bam file index: eba3609e4ffc4b38839905c3701e70ac.sorted.bam.bai **region 5:0-536870912**
reading bam file index: a4b213a41072415db4e36c7c929a5124.sorted.bam.bai **region 5:0-536870912**
finished reading bam file indexes 
reading **chromosome X offset 83929913** 
.....processed 1000000 reads QSIZE:4273 V:103732:103883 variants called 13
.....processed 2000000 reads QSIZE:2603 V:210786:210937 variants called 13
.....processed 3000000 reads QSIZE:68388 V:266051:266202 variants called 13
--region X:
./CRISP-genotypes --bams bamlist.txt  --ref Caenorhabditis_elegans.WBcel235.dna.toplevel.fa --VCF VariantsCall.vcf --leftalign 1 --EM 1 **--regions X** > variantcalls.log

insertions and deletions in reads will be left justified w.r.t. reference sequence (limited evaluation) 

EM algorithm will be used for estimating genotypes and calling variants

CRISP options: QVoffset 33 min_base_quality 13 min_mapping_score 20 max_permutations 20000 poolsize 2 CT-pvalue-thresh -3.5 QV-pvalue-thresh -5.0

reading fasta index file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.fai ... fasta file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa has 7 chromosomes/contigs

processing 6 bamfiles: eba3609e4ffc4b38839905c3701e70ac.sorted.bam ..... a4b213a41072415db4e36c7c929a5124.sorted.bam 

poolsizes are variable and read from file: 36...440
reading bam file index: eba3609e4ffc4b38839905c3701e70ac.sorted.bam.bai **region 6:0-536870912**
reading bam file index: a4b213a41072415db4e36c7c929a5124.sorted.bam.bai **region 6:0-536870912**
finished reading bam file indexes 
reading **chromosome MtDNA offset 101944233** 
reference base  at position MtDNA:13794 is ambiguous, site will be ignored, provide reference fasta file with bases in [ACTGN] to call variants at these sites
reference base  at position MtDNA:13795 is ambiguous, site will be ignored, provide reference fasta file with bases in [ACTGN] to call variants at these sites
reference base  at position MtDNA:13796 is ambiguous, site will be ignored, provide reference fasta file with bases in [ACTGN] to call variants at these sites
--regions MtDNA
 ./CRISP-genotypes --bams bamlist.txt  --ref Caenorhabditis_elegans.WBcel235.dna.toplevel.fa --VCF VariantsCall.vcf --leftalign 1 --EM 1 **--regions XMtDNA** > variantcalls.log

insertions and deletions in reads will be left justified w.r.t. reference sequence (limited evaluation) 

EM algorithm will be used for estimating genotypes and calling variants

CRISP options: QVoffset 33 min_base_quality 13 min_mapping_score 20 max_permutations 20000 poolsize 2 CT-pvalue-thresh -3.5 QV-pvalue-thresh -5.0

reading fasta index file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.fai ... fasta file Caenorhabditis_elegans.WBcel235.dna.toplevel.fa has 7 chromosomes/contigs

processing 6 bamfiles: eba3609e4ffc4b38839905c3701e70ac.sorted.bam ..... a4b213a41072415db4e36c7c929a5124.sorted.bam 

poolsizes are variable and read from file: 36...440
reading bam file index: eba3609e4ffc4b38839905c3701e70ac.sorted.bam.bai region 4:0-536870912
reading bam file index: a4b213a41072415db4e36c7c929a5124.sorted.bam.bai region 4:0-536870912
finished reading bam file indexes 
reading chromosome V offset 62656939 
.....processed 1000000 reads QSIZE:48301 MtDNA:5292:5444 variants called 0
.....processed 2000000 reads QSIZE:44379 MtDNA:10965:11116 variants called 0
processing 10883 reads left in queue for chrom V.....cleaning read queue from prev chrom
finished processing reads for chrom V 

CRISP has finished processing bam files: total reads processed 2331662 total variants called 0 
samtools view -H sorted.bam 
@HD VN:1.3  SO:coordinate
@SQ SN:I    LN:15072434
@SQ SN:II   LN:15279421
@SQ SN:III  LN:13783801
@SQ SN:IV   LN:17493829
@SQ SN:MtDNA    LN:13794
@SQ SN:V    LN:20924180
@SQ SN:X    LN:17718942
@RG ID:2898842525   LB:CEAG-EMS-JPY-02-lib1 PL:ILLUMINA PU:HF5NTCCXY.5  SM:CEAG-EMS-JPY-02
@RG ID:2898995713   LB:CEAG-EMS-JPY-02-lib1 PL:ILLUMINA PU:HFKHHCCXY.2  SM:CEAG-EMS-JPY-02
THYeh44 commented 4 years ago

The reason I have the wrong chromosome offset is because of the .bam files and the reference fasta file has a different chromosome order. Everything works just fine again after changing the chromosome order of the reference file to the same as .bam file.

vibansal commented 4 years ago

That makes sense. It is important to make sure that the same fasta file is used for alignment and variant calling.

THYeh44 commented 4 years ago

I've been using the same fast file for alignment and several different variants calling scripts, it's my first time having this problem.

TH Sent from my iPhone

On Dec 6, 2019, at 6:03 PM, Bansal Lab notifications@github.com wrote:

 That makes sense. It is important to make sure that the same fasta file is used for alignment and variant calling.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.