Closed ctsa closed 1 month ago
Also see https://github.com/samtools/htslib/pull/1573 for htslib handling of POS=0. It's somewhat of a low priority as you can see by the speed of review. Personally I think this is a badly thought out feature, compounded by BCF and BAI indices which store pos-1 so position 0 becomes negative, arbitrarily limiting the choice of various data types and also breaking the index binning calculations. It sounds like picard also doesn't query this data correctly either, so I suspect the POS 0 is a feature that's pretty much unsupported in the wild.
Not sure what that says about the spec and this question though. Over to you Daniel :)
Thanks for the additional context James. Regardless of future design decisions, it sounds like in the short term we could special-case these CNVs to start at position 1 to avoid indexing headaches. These are all imprecise variant types anyway so this doesn't meaningfully change the call.
Further looking into indexing complications, it appears that IGV isn't rendering the symbolic allele range in a way that's consistent with the spec, but rather interpreting POS as included in the range.
I suspect the POS 0 is a feature that's pretty much unsupported in the wild. Not sure what that says about the spec and this question though. Over to you Daniel :)
POS 0 has been part of the specs since at least VCFv4.1 and there's been no change in this regard - the Section 5.4.5 POS=0 teleomeric example was there back in 4.1. The only change in more recent version has been additional reminder text about the sematics of symbolic SV interpretation.
It's not something that I particularly like but it's always been part of the specs and changing it now just penalises libraries and tools that actually do follow the specs so it's not something that's current on the agenda.
If VCFv5 ever comes about then it's something I'd like to revisit but that would be as part of a complete design of the specifications to properly support all types of genomic rearrangements.
Agreed it's not something we can remove. However in SAM we've sometimes made recommendations which restrict the specification, in order to avoid problematic places.
Personally I'd at least be tempted to add a footnote acknowledging reality. While POS 0 is legal within the specification, it is highly likely that a lot of tooling will break as historically it's simply not been well supported.
Thank you both, very helpful. I think I clearly understand now that I'm interpreting the letter of the spec correctly for this type of CNV, but should consider modifications for practical library support. Will close as answered.
The VCF 4.4 spec states:
"Note that the position of symbolic structural variant alleles is the position of the base immediately preceding the variant."
Does this imply that any kind of CNV detected from the beginning of a chromosome/contig will have position 0, and a
REF
value ofN
? If so is this considered valid or best practice? I see that htslib/bcftools will generate a warning for VCFs using positions less than 1, so not clear that this CNV representation is okay.