VCF 4.5 Future of VCF scaling improvements

d-cameron commented 4 months ago

Please review and provide feedback.

Rebased #434 onto VCFv4.5
Made LAA 0-based inclusive of REF
Added FORMAT END instead of #435

Logic behind going with FORMAT END:

Significant overlap between #435 RBS PR and existing <*> functionality
Allowing per-sample <*> via FORMAT END is a minimal change to the specs.
- If there's capabilities in #435 missing in this alternative implementation please raise your concerns here. I might have missed something.
END used instead of RBS for naming convention consistency.
- Happy to change to LEN if someone produces data showing a non-trival difference in file size
##REFERENCE_BLOCK_CHECKPOINT adds a new form of header parsing. We want to keep it simple
INFO END is already used by VCF indexing tools to find overlapping SVs. The checkpoint use case of determining whether a position is missing or REF is handled by an indexed lookup on that position. No additional new complexity required - the index handles that complexity!
- Implementations are free to use checkpoint records (i.e. split up large ref blocks) to ensure no reference block is greater than a given size or even ##REFERENCE_BLOCK_CHECKPOINT style spliting every POS % N bases. The spec gets to stay agnostic on these details.

github-actions[bot] commented 4 months ago

Changed PDFs as of 4ebc5aa0501172031974a2be37a5b6bc69f4bfe0: VCFv4.5.draft (diff).

d-cameron commented 4 months ago

I'd prefer three new numbers (all of which are distinct from the ploidy):

LOCAL-A. One element for each allele present in LA minus one (to exclude the reference). For example: LEC. LOCAL-R. One element for each allele present in LA (including the reference). For example: LAD. LOCAL-G. One element for each possible genotype.

Options:

Leave as .
Have R A G take mean the local allele count if LAA is not missing.
Use lowercase r, a, g for local allele Number=a vs Number=A

@danking thoughts?

(1) is the most backwards compatible, (2) has an elegant simplicity to it, (3) makes local alleles explicit and allows custom fields to be defined.

I personally favour (2) as non-local to local is a lossless transition but the consensus for LAD et al seem to be that having explicitly different fields was better because causeing errors/crashes in non-local-aware VCF consumers was preferable to the possibility of (probably silent) data misinterpretation.

(3) has the advantage of having a generalising principle for all R A G fields that we could encode in the specs: if you have a R/A/G field, the local allele equivalent has an L prefix and a lowercase Number= field. We could then change this PR a standardisation of this convention for all spec-defined fields and a recommendation for non-spec fields to also follow this convention. It would enable tooling to translate arbitrary VCFs into local-allele encoded versions.

github-actions[bot] commented 4 months ago

Changed PDFs as of 099fe1953e149c8c1ab38f9d16266622408c9624: VCFv4.5.draft (diff).

jkbonfield commented 4 months ago

Following on from discussions on the call yesterday, here is the Hail description of SVCR:

https://hail.is/docs/0.2/vds/index.html#the-scalable-variant-call-representation-svcr

Specifically they have this example

Before:    Ref=G Alt=T GT=0/1 AD=5,6 PL=102,0,150
After:     LA=0,2 LGT=0/1 LAD=5,6 LPL=102,0,150

They're using LA instead of LAA as Locale Allele vs Local Alternate Allele. The key difference being it includes REF in the list, and LGT 0/1 is 0th element of LA and 1st element of LA. I believe this is in keeping with the changes you made here vs #434, which I believe most are in agreement with (given the inherent impossibilities of storing zero-length arrays in BCF).

However, as discussed, we have tools already implementing LAA using #434 semantics, and so we need a clear separation so people that get LAA in a file know what the data means. Adopting Hail's "LA" name seems like the ideal approach.

(I know you're already in agreement Daniel, but this is here for others who weren't on the call and as a formal place to note this proposed edit.)

Pinging @pd3 for comment as you're one of the existing implementers and can far better understand the nuances than I.

d-cameron commented 4 months ago

Adopting Hail's "LA" name seems like the ideal approach.

Yes, given that it's now 0-based and we need to explicitly include REF, LA is a more appropriate name than LAA.

danking commented 4 months ago

Numbers

I'd prefer three new numbers (all of which are distinct from the ploidy): LOCAL-A. One element for each allele present in LA minus one (to exclude the reference). For example: LEC. LOCAL-R. One element for each allele present in LA (including the reference). For example: LAD. LOCAL-G. One element for each possible genotype.

Options:

Leave as .

Have R A G take mean the local allele count if LAA is not missing.

Use lowercase r, a, g for local allele Number=a vs Number=A

@danking thoughts?

(1) is the most backwards compatible, (2) has an elegant simplicity to it, (3) makes local alleles explicit and allows custom fields to be defined.

I personally favour (2) as non-local to local is a lossless transition but the consensus for LAD et al seem to be that having explicitly different fields was better because causeing errors/crashes in non-local-aware VCF consumers was preferable to the possibility of (probably silent) data misinterpretation.

(3) has the advantage of having a generalising principle for all R A G fields that we could encode in the specs: if you have a R/A/G field, the local allele equivalent has an L prefix and a lowercase Number= field. We could then change this PR a standardisation of this convention for all spec-defined fields and a recommendation for non-spec fields to also follow this convention. It would enable tooling to translate arbitrary VCFs into local-allele encoded versions.

(1) is fine but feels like a cop out for a standard. I oppose (2) for the reason you mentioned (I prefer errors/crashes to bad data). (3) feels somehow too clever, but I prefer it to (1) and (2). I prefer the verbose "LOCAL-A", etc. to (3), but I would grumpily implement (3) and then never think about it again.

LAA vs LA

Yes, given that it's now 0-based and we need to explicitly include REF, LA is a more appropriate name than LAA.

Agreed.

LEN vs END

I only have anecdotes from Tim Poterba, not hard data, but: we saw ~10% reduction in size of the Hail VDS when we switched from END to LEN. I assume all the benefit comes from feeding Zstd a lower entropy bytestream.

I'd leave LEN vs END as a possible future enhancement. A VCF with local alleles and ref blocks scales linearly in samples with or without LEN.

Etc.

No objections to your other choices. Checkpointing or an END-aware index is key for analysis but not for interchange. I view VCF as primarily an interchange format.

cc'ing @chrisvittal and @patrick-schultz as I'm no longer Broad/Hail affiliated.

FWIW, I think the SVCR paper is a little more detailed than the Hail docs currently. https://www.biorxiv.org/content/10.1101/2024.01.09.574205v1.full.pdf

d-cameron commented 4 months ago

Checkpointing or an END-aware index is key for analysis but not for interchange.

The reason for defining INFO END as I did was to avoid breaking VCF indexing. VCF indexing already uses INFO END to define the end of SV and gVCF records so reusing this means that checkpointing is unnecessary as the information needed for determining maximal record overlap is already contained within the index.

The INFO END, as a single value, will always be misinterpreted by tools written for GVCF.

If we're erring on the side of crashing instead of data corruption then deprecating INFO END for gVCF would be the way to go. Unfortunately, taking that approach will silently break VCF indexing (since it relies on INFO END). There's no good option here as keeping INFO END results in silent gVCF misinterpretation and dropping it results in silent record loss on indexed lookups.

I would grumpily implement (3) and then never think about it again.

How much value is there in being able to define caller-specific local allele fields? (3) generalises local alleles not only for scaling, but it allows lossless merging. What do implementation do with the GL when merging two VCF with a different set of ALT alleles? From what I can tell, the only viable current option is to drop it since the merging tool doesn't have the information/model that the caller has.

we saw ~10% reduction in size of the Hail VDS when we switched from END to LEN

That seems significant enough to prefer LEN. We've already made SVLEN preferable to END for v4.4 SVs so doing it for symbolic star alleles for 4.5 is consistent with that.

On a related note, what is everyone's thoughts about recommending tools writing VCFv4.5 use local allele notation for all VCFs? Useful? Unnecessary?

danking commented 4 months ago

The reason for defining INFO END as I did was to avoid breaking VCF indexing. [...]

If we're erring on the side of crashing instead of data corruption then deprecating INFO END for gVCF would be the way to go. [..]

Mmm. Good points. Defining INFO END as the max seems fine then.

How much value is there in being able to define caller-specific local allele fields? (3) generalises local alleles not only for scaling, but it allows lossless merging. What do implementation do with the GL when merging two VCF with a different set of ALT alleles? From what I can tell, the only viable current option is to drop it since the merging tool doesn't have the information/model that the caller has.

I don't have experience with variant callers so I can't comment on the first question.

Maybe I'm misreading but I think any of the three options for encoding local allele indexed fields facilitates lossless merging.

I agree that (global allele indexed) GL, PL, and AD are hard to sensibly define when merging two VCFs. In particular, I'm unclear on when DP does not equal sum(AD) and the implications thereof when merging two VCF rows with global allele indexed fields.

we saw ~10% reduction in size of the Hail VDS when we switched from END to LEN

That seems significant enough to prefer LEN. We've already made SVLEN preferable to END for v4.4 SVs so doing it for symbolic star alleles for 4.5 is consistent with that.

Would we change the GVCF section as well? So we'd have INFO LEN, FORMAT LEN, and (no longer preferred? deprecated?) INFO END?

On a related note, what is everyone's thoughts about recommending tools writing VCFv4.5 use local allele notation for all VCFs? Useful? Unnecessary?

Soft against. It's more complex and not necessary for plenty of useful cases. For example, seems silly to local allele index fields in a VCF that has been fully split into biallelic variants.

jkbonfield commented 4 months ago

I agree that (global allele indexed) GL, PL, and AD are hard to sensibly define when merging two VCFs. In particular, I'm unclear on when DP does not equal sum(AD) and the implications thereof when merging two VCF rows with global allele indexed fields.

I don't have enough experience either, particularly with multi-sample calling and gVCF, but just as some background I do understand why AD and DP can be very different as I recently fixed over counting of AD in bcftools. Imagine this scenario:

REF   GATCGATATATATGTCGTC                                                       

SEQs  GATCGATA                                                                  
      GATCGATATATA                                                              
      GATCGATATATATG                                                            
      GATCGATATATATGTC                                                          
      GATCGATATATATGTCGTC                                                       
       ATCGA--TATATGTCGTC                                                       
        TCGA--TATATGTCGTC                                                       
          GATATATATGTCGTC                                                       
              TATATGTCGTC                                                       
            ^                                                                   
            2bp deletion

The depth at that point is 8 sequences. It's an AT STR, with 6 of them spanning it so the copy number has been confirmed. Of those we have 4 with 4 copies and 2 with 3 copies, so AD 4,2 and DP 8. The missing two have been aligned against the reference and exhibit "reference bias" as they naively confirm REF but there is no evidence of that confirmation being true and they shouldn't be added to the REF AD score. (There is an option however to distribute the uncontributed alignments to the observered REF/ALT in proportion to their observed frequencies, in order to boost AD, but it's not enabled by default as it can bring other issues and potentially over confidence in results.)

How you marry this together when merging I've no idea. Sorry! :)

Edit: you could make an argument for DP=6, but that's misleading as we do have 8 alignments spanning this variant and if we were attempting to do depth analysis it's wrong to down-play DP this way as that may lead to other incorrect copy-number assessments.

d-cameron commented 3 months ago

Maybe I'm misreading but I think any of the three options for encoding local allele indexed fields facilitates lossless merging.

A VCF with custom R/A/G FORMAT fields cannot be losslessly converted to local allele notation unless there is the ability to translate arbitrary R/A/G fields to an equivalent r/a/g version. (3) achieves this by generalisating the local allele convention to all fields - even those not reserved by the specs.

d-cameron commented 3 months ago

Would we change the GVCF section as well? So we'd have INFO LEN, FORMAT LEN, and (no longer preferred? deprecated?) INFO END?

FORMAT LEN and retain INFO END as a purely computed field to prevent breaking VCF indexing. We already have INFO SVLEN causing recalculation of INFO END so this simplifies the spec as INFO END becomes an entirely computed field that a validator can verify than INFO END=POS + max(FORMAT LEN & INFO SVLEN).

Similar to the SV changes in v4.4, just having INFO END would be deprecated but a 4.5 compliant implementation should infer the value of a single missing FORMAT LEN from INFO END.

pd3 commented 3 months ago

Hi, sorry I am a bit late to this thread, or maybe I was on it too early, depending on how you look at it

The ideas proposed in the original pull request (#434) in July 2019 were implemented in bcftools in May 2020 (https://github.com/samtools/bcftools/commit/e645749ab36dfa7b259d786bed7da0c6dc3a40fd). Currently there are big files out there that follow the original proposal, see for example UK Biobank which generated VCFs with 490,541 samples using DRAGEN. (Note that DRAGEN implementation has some bugs, please contact me directly if you are interested to learn more.)

I have several comments related to the current proposal:

1) The addition of LGT argued in this comment https://github.com/samtools/hts-specs/pull/434#issuecomment-1918141082 is not a good idea. As pointed out by https://github.com/samtools/hts-specs/pull/434#issuecomment-1381761185, it doesn't save any space and just adds the indirection of parsing through LAA.

The motivation given in that thread cites the need to subset samples and remove unused ALT alleles. This would lead to recoding of local alleles for all samples. However, that argument is not quite right:

First, localized alleles were introduced to reduce space taken by the quadratic Number=G format fields, to a lesser degree by the Number=A,R tags. Removing a few extra unused alleles from the ALT column will never be an issue - why not just leave them, or recode only rarely after many alleles were removed??

Second, and more importantly, the LAA indices refer to alleles in the ALT column. Therefore if any change is made to ALT, the LAA tag must be changed as well. Unless I grossly misunderstood the example, using LGT only helps with not having to recode LGT, but LAA itself needs to be recoded anyway. Timewise there will be no measurable benefit, the complexity of the operation will remain at O(N_SAMPLES_KEPT), and LGT adds no benefit.

Third problem, and perhaps yet even more important, all existing programs relying on GT will stop functioning on such VCFs.

I don't find the addition of LGT into the specification sufficiently justified.

2) In the original draft, and in several existing implementations of the draft (bcftools, DRAGEN, maybe others), the LAA tag includes only alternate alleles and does not include the REF allele. This pull request proposes a new tag LA which includes the REF allele. This has already resulted in two competing implementations and their use in large datasets.

I don't understand the motivation for introducing this split. I was told that it was argued offline that it had something to do with the missing allele, but I was not able to understand why. With LAA, a REF-only site can be encoded as e.g.

GT=0/0      LAA=.   LAD=20      LPL=0

Missing genotype can be encoded simply as

GT=./.      LAA=.   LAD=.      LPL=.

What is the benefit of introducing LA other than using more space and adding uninformative "0"?? The localized alleles are meant only to narrow the context of the alleles so that Number=G,R,A tags can be expressed in a more parsimonious way. They were not intended to make statements about sequencing or biological observations made for a sample.

danking commented 3 months ago

Hi @pd3 ! I try to explain the decisions below but see also my final paragraph.

I agree with all your comments on 1 and, in fact, this was our perspective when we first built tools using local alleles and ref blocks. The issue is not of complexity or speed; it’s of practical user experience. IIRC (we should just ask them) gnomAD team had a mistake arise from neglecting to update a GT field when an allele was filtered. They reasonably asked: why is GT special? There’s less cognitive overhead for them as method developers if all the fields use local indices.

AFAIU, this change just allows users who prefer to work with LGT to do so. I suppose the practical concern is that people will start generating LGT containing VCFs so tools will need to support that?

IIRC, on your second issue, folks wanted to distinguish “only the reference allele is local” from “LAA is missing”. I suggested LA as a way to appease this concern.

—

All that said, the only thing I seek (and I think Hail team, with which I’m no longer affiliated, also seeks) is an interchange format that scales linearly in samples. Sparse columns (ref blocks) and sparse allele-indexed arrays (local alleles) achieve this. Whatever details make it into the standard we’ll just deal with on import/export.

danking commented 3 months ago

Minor nit:

This pull request proposes a new tag LA which includes the REF allele. This has already resulted in two competing implementations and their use in large datasets.

Hail always used LA, so it’s not really this PR that created the problem.

jkbonfield commented 3 months ago

Minor nit:

This pull request proposes a new tag LA which includes the REF allele. This has already resulted in two competing implementations and their use in large datasets.

Hail always used LA, so it’s not really this PR that created the problem.

Agreed. We were chatting about this a few days ago.

Basically there are already two competing implementations: LA and LAA. Both have existed for a considerable length of time. If I had to weigh up the usage I'd say it's Hail and it's users (Broad and AllOfUs being the biggest maybe) vs Illumina DRAGEN and Bcftools with BioBank being the biggest user there. There are many more users of each ecosystem too, including probably very large ones I'm not aware of. So making a decision purely on usage doesn't really help, and it needs to be evaluated on complexity and fitting to the data modelling.

My initial thought was that we needed LA because we can't have an empty (non .) LAA in BCF (as was used in an example in the original PR). We concluded that just broke things and we didn't like the disparity between unknown (.) vs absent. However Petr does have a valid point that dot is already used in other scenarios to mean not-called rather than simply unknown.

I do have confusion over LGT vs GT though. If we're using LAA instead of LA, then LGT doesn't make sense because a GT 0/1 would refer to what in LGT? ./0 (0 being 1st alt in LAA)? Would tools even accept that? It's not clear to me how it's meant to work with REF/ALT heterozygous calls if REF is absent. GT I assume is still going on the main REF and ALT columns (ie totally unchanged) so it's simply some slightly larger numbers. That's still a loss in compressability (more so than LAA to LA overhead with a good backend compression codec), however in the overall scheme of themes both are probably an irrelevance for sizing as it's the LPL which is making the huge difference. I also note that the 434 example had LAA:LGT:LAD:LPL& :0/0:30:0 which means no sense at all. An empty LAA means only REF is permitted, but then we have LGT referring to the first (unspecified) ALT allele. I assume the intention was LGT had an explicit 0 for REF, but then it's offset from LAA. Muddled thinking (theirs or mine?).

pd3 commented 2 months ago

I do have confusion over LGT vs GT though. If we're using LAA instead of LA, then LGT doesn't make sense because a GT 0/1 would refer to what in LGT? ./0 (0 being 1st alt in LAA)?

The definition of LGT in the context of LAA is consistent. LAA lists locally relevant alternate alleles, their indexing is 1-based, the same as in GT. Heterozygous genotypes would be expressed as LAA=1 GT=0/1 LGT=0/1.

Note that despite defending its logic here, I am strongly against codifying LGT for the reasons I explained above (https://github.com/samtools/hts-specs/pull/758#issuecomment-2036713925).

jkbonfield commented 2 months ago

I do have confusion over LGT vs GT though. If we're using LAA instead of LA, then LGT doesn't make sense because a GT 0/1 would refer to what in LGT? ./0 (0 being 1st alt in LAA)?

The definition of LGT in the context of LAA is consistent. LAA lists locally relevant alternate alleles, their indexing is 1-based, the same as in GT. Heterozygous genotypes would be expressed as LAA=1 GT=0/1 LGT=0/1.

Agreed on there being little benefit to LGT over GT. It's smaller, but the big win is LPL and everything else is icing on the cake.

Interesting different interpretation to existing GT being 1 based. I'd sort of mentally just pictured a zero based list of REF & ALTs all squished together. It amounts to the same thing, but breaks my mental model when we start mixing LA vs LAA together, and the "local" here means "ref+local" which isn't an obvious first thought. Obvious in hindsight clearly!

a9ill commented 2 months ago

Apologies for being late to join this discussion. I am developer at Illumina working on DRAGEN gVCF Genotyper. Incorporating a description of a local representation into the standard is important and the discussion here is worthwhile.

As has been pointed out (https://github.com/samtools/hts-specs/pull/758#discussion_r1555424235), the proposed text seems to confuse LA (local alleles - including the reference) and LAA (local alternative alleles - excluding the reference) in both the definitions and examples.

My preference is for LAA as it is the most parsimonious, since the “0” in LA does not convey information (as pointed out https://github.com/samtools/hts-specs/pull/758#issuecomment-2036713925). LAA should only be empty for reference homozygous genotypes without other alleles present and should only be missing for missing genotypes. As these can be distinguished through examination of GT, I don't understand the utility of the ability to distinguish between them based on LA alone. Please correct my understanding if I have missed something.

It might be of interest that the recently released UK Biobank WGS data (https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/news/world-s-largest-genetic-project-opens-the-door-to-new-era-for-treatments-and-cures-uk-biobank-s-major-milestone) was joint called using DRAGEN and LAA was used to specify the local alleles.

For merging and splitting local representations it will be necessary to specify the expected number of elements in local array shaped fields. Using “.” will not achieve this. Using “R”, “A” and “G” and taking them as local if LAA/LA is present would preclude the mixing of local and non-local fields. I agree with (https://github.com/samtools/hts-specs/pull/758#issuecomment-1977003033) that using lower case seems too clever. My preference would be to use “LOCAL-A”, “LOCAL-R” and “LOCAL-G” or more succinctly “LA”, “LR” and “LG”.

danking commented 2 months ago

Regarding the desire to distinguish “missing LAA” from “only the reference is present”, I did not personally raise this concern so I’m trying to interpret others’ concerns. That said, I think the use of two distinct sets of numbers (LA/LOCAL-A/a versus A, etc.) mitigates their concerns.

Consider a VCF which has some rows with local allele indexed items and some without. If there were no special numbers and no special field names, then it’s unclear when PL, for example, is local without deducing that from the alleles and the PL length. However, if PL is always G and LPL is always LG/LOCAL-G/g, then no deduction is needed and there’s no need to signal “this row uses local alleles” via a missing LAA.

danking commented 2 months ago

@pd3, I’d like to make a final nudge on LGT, but after this I’ll be quiet about it.

I agree with your points one and two: LGT is not relevant to VCF size or analysis speed. I also agree with point three: a VCF with only LGT is meaningless to tools only aware of GTs.

My counter-point is about users—not engineering constraints. Mixing GT with LPL and LAD introduces cognitive overhead and, in our practical experience, leads to mistakes.

I argue in favor of allowing LGT. If two scientific groups want to exchange data in this format we should let them and let them deal with the ensuing incompatibilities with legacy tools.

Any group publishing ”final” data for an extremely wide audience (e.g. AoU, UKB) will likely choose to maximize compatibility with existing tools. Groups with other constraints can choose what’s best for them.

pd3 commented 2 months ago

@pd3, I’d like to make a final nudge on LGT, but after this I’ll be quiet about it.

I agree with your points one and two: LGT is not relevant to VCF size or analysis speed. I also agree with point three: a VCF with only LGT is meaningless to tools only aware of GTs. My counter-point is about users—not engineering constraints. Mixing GT with LPL and LAD introduces cognitive overhead and, in our practical experience, leads to mistakes.

I argue in favor of allowing LGT. If two scientific groups want to exchange data in this format we should let them and let them deal with the ensuing incompatibilities with legacy tools.

Any group publishing ”final” data for an extremely wide audience (e.g. AoU, UKB) will likely choose to maximize compatibility with existing tools. Groups with other constraints can choose what’s best for them.

I don't find cognitive overhead a good argument. Few users look manually inside VCFs with hundreds of thousands of samples. Backward compatibility is a real concern, what matters is that programs continue to work with such files. With PLs there is no other way, sites with many indel alleles have no choice but to use LPL, and for most analyses it does not matter if programs skip such sites when they don't find PL, they are relatively rare. But there is no reason to confuse programs with GT/LGT.

Having said that, VCF format is open and nothing stops you from using arbitrary tags. But the use of LGT and LA should not be encouraged by including them in the specification, for reasons given above.

pd3 commented 2 months ago

For merging and splitting local representations it will be necessary to specify the expected number of elements in local array shaped fields. Using “.” will not achieve this. Using “R”, “A” and “G” and taking them as local if LAA/LA is present would preclude the mixing of local and non-local fields. I agree with (#758 (comment)) that using lower case seems too clever. My preference would be to use “LOCAL-A”, “LOCAL-R” and “LOCAL-G” or more succinctly “LA”, “LR” and “LG”.

I like and support this proposal.

@a9ill Unrelated to this, but for the lack of a better channel, there are two bugs in DRAGEN's implementation of local alleles (found in UKB VCFs)

LAA must be Type=Integer, not String
the proposal requires LAA to be ordered in strictly ascending order. There are cases where LAA=5,2; now it is not clear what order are the corresponding LAD and LPL values

danking commented 2 months ago

I don't find cognitive overhead a good argument. Few users look manually inside VCFs with hundreds of thousands of samples. Backward compatibility is a real concern, what matters is that programs continue to work with such files.

I’m glad we’ve arrived at what seems a key difference in perspective. The gnomAD team & AoU team don’t look at textual VCFs but they indeed directly interact with the data model. They aren’t the majority.