samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
662 stars 240 forks source link

norm remove duplicates doesn't handle SVLEN, removes non-duplicate symbolic variants #2182

Closed davmlaw closed 4 months ago

davmlaw commented 4 months ago

The following VCF contains 3 deletions of length 1kb, 2kb and 3kb:

##fileformat=VCFv4.1
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##contig=<ID=NC_000012.11,length=141213431>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
NC_000012.11    88520131    23651   C   <DEL>   .   .   SVLEN=-1000;SVTYPE=DEL
NC_000012.11    88520131    24042   C   <DEL>   .   .   SVLEN=-2000;SVTYPE=DEL
NC_000012.11    88520131    24043   C   <DEL>   .   .   SVLEN=-3000;SVTYPE=DEL

If you run (even with "exact") it removes the records with the same chrom/pos/ref/alt even though SVLEN is different (and thus separate variants)

bcftools norm --remove-duplicates --rm-dup=exact symbolic_uniq.vcf

If this is difficult, it would be good to at the least raise a warning about this, as current behavior is silent data loss. Thanks

bcftools --version
bcftools 1.20
Using htslib 1.20
pd3 commented 4 months ago

This is now supported.