samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
276 stars 244 forks source link

VariantContext has inconsistent behavior on getting allele type with spanning deletions #1686

Open rickymagner opened 9 months ago

rickymagner commented 9 months ago

Description of the issue:

For "multiallelic" sites containing a spanning deletion alleles (represented by *), the getType method in VariantContext behaves somewhat unexpectedly. There have been numerous old discussions surrounding spanning deletion handling in htsjdk, like here and here, but this particular issue is a little more focused. Regardless of what might be the "best" way to handle these special sites, the current situation is quite confusing.

If the * allele is next to a SNP or a deletion, the site is considered an INDEL. But if it's next to an insertion, it's not considered an INDEL. See below for concrete examples.

Your environment:

Steps to reproduce

Set up the following files for a complete example.

example.fasta:

>chr1
AAAAAAAAAA

Run samtools faidx example.fasta and samtools dict example.fasta > example.dict. Then make the following vcf as example.vcf:

##fileformat=VCFv4.3
##contig=<ID=chr1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample
chr1    3   variant1    A   AAA,*   30  .   .   GT  1/2
chr1    4   variant2    A   C,* 30  .   .   GT  1/2
chr1    5   variant3    AAA A,* 30  .   .   GT  1/2

Run bcftools view example.vcf -o example.vcf.gz to compress it and bcftools index -t example.vcf.gz. Now it's ready to run in GATK to see what happens. Run gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-snp.vcf.gz -select 'vc.isSNP()' and gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-indel.vcf.gz -select 'vc.isIndel()'

As expected, gatk-snp.vcf.gz looks like (with bcftools view -H):

chr1    4   variant2    A   C,* 30  .   .   GT  1/2

but gatk-indel.vcf.gz looks like:

chr1    5   variant3    AAA A,* 30  .   .   GT  1/2

In other words, the site with <INS>,* is excluded while <DEL>,* is included when looking for INDEL sites.

Expected behaviour

I don't know what's the best way once and for all to classify VariantContexts with spanning deletions, but to me this seems like a "bug" either way. Whatever is decided in classifying the <INS>,* type should equally apply to <DEL>,*. For what it's worth, my instinct is to expect these sites to both be classified as "Indel" in this way since the spanning deletion is a deletion anyway. This is consistent with the behavior that SelectVariants will also pull out a multiallelic DEL+INS site as "Indel."

Actual behaviour

See above.