samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
655 stars 172 forks source link

VCF 4.4: clarification on CNVs with position 0 #792

Closed ctsa closed 1 month ago

ctsa commented 1 month ago

The VCF 4.4 spec states:

"Note that the position of symbolic structural variant alleles is the position of the base immediately preceding the variant."

Does this imply that any kind of CNV detected from the beginning of a chromosome/contig will have position 0, and a REF value of N? If so is this considered valid or best practice? I see that htslib/bcftools will generate a warning for VCFs using positions less than 1, so not clear that this CNV representation is okay.

jkbonfield commented 1 month ago

Also see https://github.com/samtools/htslib/pull/1573 for htslib handling of POS=0. It's somewhat of a low priority as you can see by the speed of review. Personally I think this is a badly thought out feature, compounded by BCF and BAI indices which store pos-1 so position 0 becomes negative, arbitrarily limiting the choice of various data types and also breaking the index binning calculations. It sounds like picard also doesn't query this data correctly either, so I suspect the POS 0 is a feature that's pretty much unsupported in the wild.

Not sure what that says about the spec and this question though. Over to you Daniel :)

ctsa commented 1 month ago

Thanks for the additional context James. Regardless of future design decisions, it sounds like in the short term we could special-case these CNVs to start at position 1 to avoid indexing headaches. These are all imprecise variant types anyway so this doesn't meaningfully change the call.

ctsa commented 1 month ago

Further looking into indexing complications, it appears that IGV isn't rendering the symbolic allele range in a way that's consistent with the spec, but rather interpreting POS as included in the range.

d-cameron commented 1 month ago

I suspect the POS 0 is a feature that's pretty much unsupported in the wild. Not sure what that says about the spec and this question though. Over to you Daniel :)

POS 0 has been part of the specs since at least VCFv4.1 and there's been no change in this regard - the Section 5.4.5 POS=0 teleomeric example was there back in 4.1. The only change in more recent version has been additional reminder text about the sematics of symbolic SV interpretation.

It's not something that I particularly like but it's always been part of the specs and changing it now just penalises libraries and tools that actually do follow the specs so it's not something that's current on the agenda.

If VCFv5 ever comes about then it's something I'd like to revisit but that would be as part of a complete design of the specifications to properly support all types of genomic rearrangements.

jkbonfield commented 1 month ago

Agreed it's not something we can remove. However in SAM we've sometimes made recommendations which restrict the specification, in order to avoid problematic places.

Personally I'd at least be tempted to add a footnote acknowledging reality. While POS 0 is legal within the specification, it is highly likely that a lot of tooling will break as historically it's simply not been well supported.

ctsa commented 1 month ago

Thank you both, very helpful. I think I clearly understand now that I'm interpreting the letter of the spec correctly for this type of CNV, but should consider modifications for practical library support. Will close as answered.