mitoNGS / MToolBox

A bioinformatics pipeline to analyze mtDNA from NGS data
http://sourceforge.net/projects/mtoolbox/?source=navbar
GNU General Public License v3.0
90 stars 38 forks source link

vcf-file format #51

Closed sboer closed 6 years ago

sboer commented 6 years ago

Hi!

I get a strange format of my vcf-file, which does not seem to correspond to the correct vcf-format (http://www.internationalgenome.org/wiki/Analysis/vcf4.0/). According to the description of the vcf 4.0 format, all samples should have the same format. In my file, most samples have only the GT-format, while some have the rest specified as well (DP:HF:CILOW:CIUP). In addition, the genotype format seems to be wrong for most of these individuals, as haploid markers should have one allele only, while in my file, most have two (0/1). Others does have only one.

Is this how the vcf-file is supposed to look like, or is it something wrong with my file? I am trying to convert the vcf-file to haps/legend/sample-file to use it as a reference panel, but I get totally wrong output for the heteroplasmic observations, which I guess is because the format is wrong.

This is a subsample of my file (ID's excluded)

fileformat=VCFv4.0

reference=chrRCRS

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

INFO=

INFO=

chrMT 146 . C T . PASS AC=1266;AN=1948 GT:DP:HF:CILOW:CIUP 1:248:1.0:0.982:1.0 1:381:1.0:0.988:1.0 1:403:1.0:0.989:1.0 chrMT 150 . CCCA TCCA,C . PASS AC=199,1;AN=1594 GT:DP:HF:CILOW:CIUP 0 0/1:353:0.983:0.963:0.993 0 chrMT 151 . C T . PASS AC=59;AN=1455 GT:DP:HF:CILOW:CIUP 0 0 0 chrMT 152 . C T . PASS AC=1157;AN=1954 GT:DP:HF:CILOW:CIUP 0/1:215:0.995:0.971:1.0 0 0/1:382:0.997:0.984:1.0 chrMT 153 . A G . PASS AC=17;AN=1416 GT:DP:HF:CILOW:CIUP 0/1:210:0.99:0.964:1.0 0 0

Would be very thankful for any help!

clody23 commented 6 years ago

Hi,

may I ask what is the tool you are trying to use for generating the haps?

I confirm that MToolBox reports the homoplasmy as 1:DP:CILOW:CIUP and the heteroplasmy as 0/1:DP:CILOW:CIUP because we reasoned that mtDNA is polyploid (it has multiple copies of mtDNA) so we chose to report the number of alleles observed over that mtDNA position analysed, i.e. if that position is fully homoplasmic for alelle T then we assign genotype of 1 otherwise 0/1 or 0/1/2 if we see one or more alternative to reference alleles. We agree though that this GT definition deviates from the 'official' v.4.0 VCF file format definition. However, we welcome suggestions from users on how to deal with this polyploidy in the MToolBox VCF files.

In my case, when I need to use MToolBox vcfs for other downstream analyses (e.g. use vcfs for Haplogrep2 haplogroup predictions and so on...) I use a custom python script I uploaded on github to filter the MToolBox vcf and convert the homoplasmic genotypes to 1/1 and split variant positions with multiple alleles into one alternative allele per row (e.g. a position with 0/1/2 will be converted into 2 consecutive rows with 0/1 and 0/1 genotypes, respectively).

I am not sure this can be a solution for your analysis but you could try to run this auxiliary script: https://github.com/mitoNGS/MToolBox/blob/master/aux/filter_HF.py

as in this example:

python filter_HF.py HG00119 ../test/HG00119_example/HG00119/HG00119.vcf 0.5 50 vcf  HG00119_filt.vcf Yes

and get a filtered VCF file with official GT https://github.com/mitoNGS/MToolBox/blob/master/test/HG00119_filt.vcf

Hope this helps

sboer commented 6 years ago

Hi!

Thank you so much for your answer!

I am using bcftools (convert --haplegendsample) to convert to haps/legend/sample.

We did eventually manage to make a script that converted the polyploid genotypes to haploid genotypes, with the possibility to later split multiallelic variants into several biallelic ones. After this conversion, bcftools ran without problems.

I guess that the need for heteroplasmic calls varies according to the use of the data, but I am very impressed that it is actually possible. In my case, using the data as a reference panel, it was a little pain to convert it to "clean" haploid calls, but we managed eventually.

Maybe it could be an option to choose polyploid calls or haploid calls in the vcf-file? Or link to your script at the MToolBox webpage?

Thanks again!

clody23 commented 6 years ago

Oh, ok...sorry for the late reply then.

Yes, maybe we could think about a further enhancement of the pipeline where we integrate the filter_HF script into the pipeline to give the user the option of getting "diploid" genotypes (0/1 and 1/1) but keeping HF as VCF annotation. I think we need some thinking around this issue but thanks for raising it and for using the tool.

I will close this issue now and open a new enhancement to keep a note for us.

best, Claudia