hsinnan75 / MapCaller

MapCaller – An efficient and versatile approach for short-read alignment and variant detection in high-throughput sequenced genomes
MIT License
29 stars 5 forks source link

Include monomorphic sites in VCF #24

Open carlosmag opened 4 years ago

carlosmag commented 4 years ago

Hi!

I miss an option to report all sites in VCF outupt, even the ones without genomic variants. That would allow me to flag potentially non-informative sites, e.g. with low read depth. Applications in the comparison of multiple bacterial isolates, phylogeny, etc.

P.S. MapCaller is rocking on my 6-core i7-9750H 2.60GHz CPU. ~ 2 min for 2x5GB fastq. Congratulations for your work!

hsinnan75 commented 4 years ago

Hi, what do you mean by "all sites without genomic variants". Could you please explain with more detail? Thank you.

carlosmag commented 4 years ago

Perhaps best explained in GATK documentation

The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps. The records in a GVCF include an accurate estimation of how confident we are in the determination that the sites are homozygous-reference or not.

hsinnan75 commented 4 years ago

Thank you for the explanation. I'll study the details of GVCF and try to add this feature.

hsinnan75 commented 4 years ago

Hi, I've added basic support for gVCF (v0.9.9.b). To output gVCF, Run MapCaller with "-gvcf". I read the format specification and follow Freebayes's tags to output gVCF file. Let me know if you find any problems with gVCF output. Thank you!

carlosmag commented 4 years ago

Hi! Thanks for the update!

I did some tests and noticed the tool is just reporting the genomic position immediately after a called variant.

-gvcf output:

MTB_anc 4394210 . C G 30 PASS DP=90;AD=90;AF=1.000;GT=1|1;TYPE=SUBSTITUTE MTB_anc 4394211 . A <*> 0 . END=4395386;DP=90;MIN_DP=73

However, I needed a VCF file reporting from position 1 to the last genomic coordinate of the reference genome.

To clarify, here follows the corresponding FreeBayes option (same definition as GATK I believe):

--report-monomorphic Report even loci which appear to be monomorphic, and report all considered alleles, even those which are not in called genotypes. Loci which do not have any potential alternates have '.' for ALT.

hsinnan75 commented 4 years ago

I've uploaded a new version (v0.9.9.c) to report all loci which do not have any potential alternates. Please run MapCaller with "-monomorphic" instead of "-gvcf". In fact, MapCaller outputs the similar results with Freebayes when putting "-gvcf." I compared MapCaller's output (-monomorphic) to Freebayes's output (--report-monomorphic) and they looked similarly, though the QUAL of all monomorphic loci are set to 0. I am not sure how to calculate the QUAL for those loci.

carlosmag commented 4 years ago

Thanks for the implementation!

I believe I spot some inconsistencies in the outputs, for which I might create separate issues if relevant.

ERR2653232 Reference genome

MapCaller -ploidy 1 -i MTB_anc.fasta -f ERR2653232_1.fastq -f2 ERR2653232_1.fastq -monomorphic

CHROM POS ID REF ALT QUAL FILTER INFO

MTB_anc 1 . ^@ . 0 . DP=1;NS=1;DUP=0;GT=0|0 MTB_anc 2 . T . 0 . DP=7;NS=7;DUP=4;GT=0|0

grep -w '79548' -B1 -A10 output.vcf

MTB_anc 79547 . G . 0 . DP=343;NS=343;DUP=5;GT=0|0 MTB_anc 79548 . CCCGGTGGAC C 14 PASS DP=385;AD=181;DUP=5;AF=0.470;GT=0|1;TYPE=DELETE MTB_anc 79548 . A . 0 . DP=348;NS=347;DUP=5;GT=0|0 MTB_anc 79550 . C . 0 . DP=167;NS=167;DUP=1;GT=0|0 MTB_anc 79551 . C . 0 . DP=170;NS=170;DUP=3;GT=0|0 MTB_anc 79552 . G . 0 . DP=172;NS=172;DUP=2;GT=0|0 MTB_anc 79553 . G . 0 . DP=176;NS=176;DUP=2;GT=0|0 MTB_anc 79554 . T . 0 . DP=178;NS=178;DUP=1;GT=0|0 MTB_anc 79555 . G . 0 . DP=186;NS=186;DUP=5;GT=0|0 MTB_anc 79556 . G . 0 . DP=194;NS=194;DUP=4;GT=0|0 MTB_anc 79557 . A . 0 . DP=198;NS=198;DUP=2;GT=0|0 MTB_anc 79558 . C . 0 . DP=383;NS=383;DUP=3;GT=0|0 MTB_anc 79559 . G . 0 . DP=391;NS=391;DUP=5;GT=0|0

grep -w '103871' -C1 output.vcf

MTB_anc 103870 . G . 0 . DP=7;NS=7;DUP=0;GT=0|0 MTB_anc 103871 . T . 0 . DP=3;NS=3;DUP=0;GT=0|0 MTB_anc 104617 . A . 0 . DP=2;NS=2;DUP=1;GT=0|0

IGV output secondayalignments_igv

This might help with QUAL. From VCFv4.3 specification

QUAL — quality: Phred-scaled quality score for the assertion made in ALT. i.e.−10log10prob(call in ALT iswrong). If ALT is ‘.’ (no variant) then this is−10log10prob(variant), and if ALT is not ‘.’ this is−10log10prob(no variant). If unknown, the MISSING value must be specified. (Float)

hsinnan75 commented 4 years ago

Hello,

Thank you for reporting the errors. I've found those bugs and corrected them. Please update MapCaller to v.0.9.9.d. Now issue#1 and #2 have been fixed. Those errors were due to wrong position calculation. However, issue#3 involves ambiguous alignments. The algorithm of MapCaller ignores all read alignments with multiple mapping coordinates. In such cases, it is difficult to tell which coordinate is correct, therefore, we just ignore all such alignments. In fact, MapCaller will report them as CNV events in the future version. Please let me know if you have any suggestions about how to deal with those regions.

You may specify output name by using "-vcf FileName.vcf". Thank you again.

carlosmag commented 4 years ago

Thanks for the fixes!

hsinnan75 commented 4 years ago

Hello,

I'd like to explain why some positions are not reported when you applying "-monomorphic". It is because the read depths at those positions are "0". That is no any read alignment mapped to those regions. I also checked Freebayes's result when applying "report-monomorphic", it also produces nothing for those regions. I was wondering if you still want MapCaller to output all positions regardless of read depths.

hsinnan75 commented 4 years ago

Hi, I'd like to know how to calculate the mean base quality. Is it the average base quality of all read bases at a specific position or it is the average base quality of the ALT allele?

tseemann commented 4 years ago

I was wondering if you still want MapCaller to output all positions regardless of read depths.

yes, i would like to see every site reported in -monomorphic mode, including DP=0 sites.

Is it the average base quality of all read bases at a specific position or it is the average base quality of the ALT allele?

Good question. What specific VCF tag/field are you wanting to put this result into?

hsinnan75 commented 4 years ago

Since MapCaller does not output read alignments to a SAM/BAM file, it would consume memory space to store information for statistic output. I was planing to output a simplified read alignment and gather required information from that file after variants are called. The VCF tag for average base quality can be Qual, and it represent the average quality of four bases, such as (30,20,20,28).