samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
653 stars 240 forks source link

[mpileup | call] Output ALT <NON_REF> or <*> instead of a dot into gVCF reference blocks #2015

Closed PlatonB closed 11 months ago

PlatonB commented 11 months ago

BCFtools 1.18 mpileup | call outputs . into the ALT of the reference block:

chr1    10001   .       T       .       .       .       END=10254;MinDP=0       GT:DP   0/0:0

For GATK CombineGVCFs and possibly other merging tools, <NON_REF> is required.

chr1    1       .       N       <NON_REF>       .       .       END=10002       GT:DP:GQ:MIN_DP:PL      0/0:0:0:0:0,0,0

<*> is acceptable if only because it is easy to replace it with <NON_REF> using sed:).

chr1    10001   .       T       <*>     0       .       END=10002       GT:GQ:MIN_DP:PL 0/0:4:1:0,3,29

My feature request is to either implement <NON_REF> output, or provide an argument (associated with --gvcf) to select an empty ALT value.

pd3 commented 11 months ago

A quick workaround solution is to run call -A to leave the star allele, however that would also leave all the other ALT alleles untouched. We can consider adding an option to always preserve the <*> allele when no alternate allele was observed.

I am a bit surprise GATK does not support the star allele <*>, as that is the recommended way to represent it by the VCF specification. By the way, bcftools merge can be used to merge gVCF as well.

PlatonB commented 11 months ago

An additional small note: HaplotypeCaller, FreeBayes and DeepVariant create the MIN_DP, not MinDP. To reduce user scripting before creating a cohort gVCF, I suggest fixing this in mpileup&call.

jkbonfield commented 11 months ago

As an aside, this feels like something that ought to be considered for hts-specs. MIN_DP turns up in an example in VCF4.3 specification (but not 4.4). It's not a documented field, but just an example.

VCF permits any tag to be used and doesn't have the notion of a controlled vocabulary or private and official name spaces (like upper and lower-case in SAM tags). However if we seem commonality between a broad range of tools, then those tags ought to be defined in the specification. If not as official tags, then at least as a recommendation for common terms, so people know what to expect and to not reinvent the wheel under another name.

Our MinDP turned up in gvcf.c on 24th Sep 2015. GATK added MIN_DP on 5th Sep 2015. I assume both copied ideas from the original gVCF from Illumina's gvcftools, but I don't seem min-dp in either form present there However gvcftools first release was April 2015 and it already had a dedicated gatk_to_gvcf tool (and no mention of bcftools), so it was obviously built with GATK in mind.

pd3 commented 11 months ago

I just added a new option -*, --keep-unseen-allele which will preserve the symbolic allele <*> in gVCF blocks. Also renamed MinDP to MIN_DP to stay compatible with other tools