For "multiallelic" sites containing a spanning deletion alleles (represented by *), the getType method in VariantContext behaves somewhat unexpectedly. There have been numerous old discussions surrounding spanning deletion handling in htsjdk, like here and here, but this particular issue is a little more focused. Regardless of what might be the "best" way to handle these special sites, the current situation is quite confusing.
If the * allele is next to a SNP or a deletion, the site is considered an INDEL. But if it's next to an insertion, it's not considered an INDEL. See below for concrete examples.
Your environment:
version of htsjdk: 3.0.1
version of java: Temurin-17.0.7+7
which OS: MacOS Ventura 13.6
Steps to reproduce
Set up the following files for a complete example.
example.fasta:
>chr1
AAAAAAAAAA
Run samtools faidx example.fasta and samtools dict example.fasta > example.dict. Then make the following vcf as example.vcf:
##fileformat=VCFv4.3
##contig=<ID=chr1,length=10>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample
chr1 3 variant1 A AAA,* 30 . . GT 1/2
chr1 4 variant2 A C,* 30 . . GT 1/2
chr1 5 variant3 AAA A,* 30 . . GT 1/2
Run bcftools view example.vcf -o example.vcf.gz to compress it and bcftools index -t example.vcf.gz. Now it's ready to run in GATK to see what happens. Run
gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-snp.vcf.gz -select 'vc.isSNP()'
and
gatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-indel.vcf.gz -select 'vc.isIndel()'
As expected, gatk-snp.vcf.gz looks like (with bcftools view -H):
chr1 4 variant2 A C,* 30 . . GT 1/2
but gatk-indel.vcf.gz looks like:
chr1 5 variant3 AAA A,* 30 . . GT 1/2
In other words, the site with <INS>,* is excluded while <DEL>,* is included when looking for INDEL sites.
Expected behaviour
I don't know what's the best way once and for all to classify VariantContexts with spanning deletions, but to me this seems like a "bug" either way. Whatever is decided in classifying the <INS>,* type should equally apply to <DEL>,*. For what it's worth, my instinct is to expect these sites to both be classified as "Indel" in this way since the spanning deletion is a deletion anyway. This is consistent with the behavior that SelectVariants will also pull out a multiallelic DEL+INS site as "Indel."
Description of the issue:
For "multiallelic" sites containing a spanning deletion alleles (represented by
*
), thegetType
method inVariantContext
behaves somewhat unexpectedly. There have been numerous old discussions surrounding spanning deletion handling in htsjdk, like here and here, but this particular issue is a little more focused. Regardless of what might be the "best" way to handle these special sites, the current situation is quite confusing.If the
*
allele is next to a SNP or a deletion, the site is considered an INDEL. But if it's next to an insertion, it's not considered an INDEL. See below for concrete examples.Your environment:
Steps to reproduce
Set up the following files for a complete example.
example.fasta:
Run
samtools faidx example.fasta
andsamtools dict example.fasta > example.dict
. Then make the following vcf as example.vcf:Run
bcftools view example.vcf -o example.vcf.gz
to compress it andbcftools index -t example.vcf.gz
. Now it's ready to run in GATK to see what happens. Rungatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-snp.vcf.gz -select 'vc.isSNP()'
andgatk SelectVariants -V example.vcf.gz -R example.fasta -O gatk-indel.vcf.gz -select 'vc.isIndel()'
As expected,
gatk-snp.vcf.gz
looks like (withbcftools view -H
):but
gatk-indel.vcf.gz
looks like:In other words, the site with
<INS>,*
is excluded while<DEL>,*
is included when looking for INDEL sites.Expected behaviour
I don't know what's the best way once and for all to classify VariantContexts with spanning deletions, but to me this seems like a "bug" either way. Whatever is decided in classifying the
<INS>,*
type should equally apply to<DEL>,*
. For what it's worth, my instinct is to expect these sites to both be classified as "Indel" in this way since the spanning deletion is a deletion anyway. This is consistent with the behavior that SelectVariants will also pull out a multiallelic DEL+INS site as "Indel."Actual behaviour
See above.