samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
648 stars 174 forks source link

INFO/END should not be deprecated #784

Open pd3 opened 2 months ago

pd3 commented 2 months ago

It appears that it was decided to deprecate INFO/END, without any public discussion (https://github.com/samtools/hts-specs/commit/97b26a1da5b8f511dba70dfcec953d17813e54f0)

Although I see the advantage of using SVLEN, as it is per-allele, rather than per site information, I argue that there is a value in keeping a INFO/END as a per site information https://github.com/samtools/hts-specs/issues/769#issuecomment-2247510552.

The deprecation seems to be motivated purely by the redundancy brought by SVLEN. However, END is also used in gVCF.

Therefore, while I understand the need for SVLEN, I wonder what is the advantage of deprecating END? All software that handles indexing, VCF record overlaps, all gVCF-aware programs have to be reviewed and possibly adjusted. That's a major effort requiring coordination of many people. Is this all for a few SV-producing programs to avoid writing one extra INFO tag, or am I missing something?

ohofmann commented 2 months ago

I am not sure I agree with the 'without public discussion' part; we had two release candidates of 4.5 out for several months and requested comments; notes went out via the GA4GH newsletter, Twitter/X updates from the spec maintainers, etc. Always happy to find ways to improve the RFC phase and would appreciate any suggestions, but it continues to be difficult to get feedback to changes until after the actual release.

This just as an aside, I do not want to take away from your core argument regarding the change.

d-cameron commented 2 months ago

The deprecation seems to be motivated purely by the redundancy brought by SVLEN.

Quite the opposite. END was deprecated in favour of FORMAT LEN as END is incompatible with per-sample gVCF (<*>) reference blocks. This change was included as part of the scalability improvements in for VCFv4.5. Once you scale to many samples in a VCF, forcing every sample to reset to a new reference block for every variant called in any samples results in significant file size overhead compared to local allele encoding and per-sample reference blocks.

I strived to maintain maximum backwards compatibilty when depreciating END and the only issue with just including END the misinterpretation of END when per-sample gVCF blocks are written. Happy to errata the VCFv4.5 specs to recommend END be included in the header & records iff there are no per-sample gVCF blocks in the VCF. There's some risk of misinterpretation by a pre-4.5 gVCF-aware parser but I would expect them to fail with a sematic parsing error when encountering a <*> record with no END defined.

In addition to the two rounds of public feedback, there was discussion of this change in the Future of VCF working group (where all the scalability changes were discussed and decided on) but it's difficult to keep everyone impacted informed of all relevant changes.

I think it would be a good idea to have ADRs for any spec change to any of the hts-specs documents as, although the actual changes are visible, the reasoning behind making any given change (e.g. why FORMAT LEN was chosen over FORMAT END) can be opaque or lost to time.

pd3 commented 2 months ago

I am not sure I agree with the 'without public discussion' part; we had two release candidates of 4.5 out for several months and requested comments; notes went out via the GA4GH newsletter, Twitter/X updates from the spec maintainers, etc. Always happy to find ways to improve the RFC phase and would appreciate any suggestions, but it continues to be difficult to get feedback to changes until after the actual release.

Thank you @ohofmann and to @jkbonfield for pointing out this was discussed in other channels. I commented on the pull request https://github.com/samtools/hts-specs/pull/758 on a different topic. Now I see that deprecation of INFO/END was mentioned in the thread there, but as it was not highlighted as one of the main points in the pull request message, I was not aware of it.

I think the appropriate place for discussion would be github discussions, I am afraid many people, me included, don't have the capacity to attend GA4GH calls.

pd3 commented 2 months ago

Quite the opposite. END was deprecated in favour of FORMAT LEN as END is incompatible with per-sample gVCF (<*>) reference blocks. This change was included as part of the scalability improvements in for VCFv4.5. Once you scale to many samples in a VCF, forcing every sample to reset to a new reference block for every variant called in any samples results in significant file size overhead compared to local allele encoding and per-sample reference blocks.

I strived to maintain maximum backwards compatibilty when depreciating END and the only issue with just including END the misinterpretation of END when per-sample gVCF blocks are written. Happy to errata the VCFv4.5 specs to recommend END be included in the header & records iff there are no per-sample gVCF blocks in the VCF. There's some risk of misinterpretation by a pre-4.5 gVCF-aware parser but I would expect them to fail with a sematic parsing error when encountering a <*> record with no END defined.

I understand how big sample sizes result in the fragmentation of REF blocks. I don't understand how do you propose the indexers should operate. Parse all FORMAT fields to figure out the maximum block size, as suggested by https://github.com/samtools/hts-specs/pull/758#issuecomment-1991618580? That would be terribly inefficient. Or give up on indexing REF-only blocks?

jkbonfield commented 2 months ago

There were attempts to raise awareness of VCF 4.5 release candidates on github, in a GA4GH email circular IIRC, and also via twitter (eg https://twitter.com/camerongenomics/status/1815619781199892971, including INFO/END, reposted numerous times including myself, John and Oliver). Understood that not everyone has time to attend the conference calls or carefully read every PR, but for a new release it may be prudent to catch up on the change summary as it's very hard to undo a change once it goes live.

Agreed that the GA4GH community could improve too. In light of that, do you have any ideas on how to better reach interested parties such as yourself? This has been a long standing issue. I faced the same problem with the SAM base modification tags. It was like pulling teeth getting any community review, until after a couple of years I gave up and published what I had. Then all the problems started to flood in, as no one wants to do the proof-reading stage. It's most demoralising as a maintainer!

Regardless, I think this is maybe also a language / phrasing issue. My understanding from speaking with Daniel is that while INFO/END is deprecated as a means of computing the END of a variant, in favour of sample-specific methods given INFO/END is quite simply inaccurate in any file with more than one sample present, it isn't deprecated as a means to building an index where it is recommended to be the largest of all sample ENDs.

We have a conference call this evening at UK 9pm. It is likely Daniel will be there, so it would be far easier to discuss your concerns via a call. You (and others) would be most welcome. Details are on the file formats mailing list, or contact Reggan. (https://www.ga4gh.org/document/meeting-minutes-file-formats-2023/)

jkbonfield commented 2 months ago

I think it would be a good idea to have ADRs for any spec change to any of the hts-specs documents as, although the actual changes are visible, the reasoning behind making any given change (e.g. why FORMAT LEN was chosen over FORMAT END) can be opaque or lost to time.

I don't understand what the ADR acronym is, but I think I get the gist. Our minutes are sadly rather poor. Generally there's no transcript at all of what's said, and mostly the recording isn't available either. This is something we ought to improve on, but it takes time and it's not something I can do while also chairing the meeting without substantially slowing things down.

However the minute details probably don't matter. A rationale for a change does, and it can be seen as sifting through all the chatter (which is likely spread over many conference calls) and distilling it to a few key points listing the pros and cons of every change, the initial impetus for change, etc. Mostly they'll be trivial too ("fixed typo", or "clarified wording").

Any suggestions where they should be placed though? In the document itself as an appendix (this could be large). In a separate document in this repository? Somewhere else? (Catch-22: this question is best answered by the people who aren't reading it!)

daviesrob commented 2 months ago

ADR is Architectural Decision Record. See for example this one for the sequence collections specification.

ohofmann commented 2 months ago

I've been reading up on https://adr.github.io/ after this morning's discussion.

jkbonfield commented 2 months ago

Thanks. I dislike hiding it away in a completely different location though. After our conference call last night I knew there was something used by refget yet I still couldn't find it - even knowing it existed!

It needs to be discoverable. That means a link to it in the Change Log section at the end of the primary TeX document, and ideally also discoverable while simply browsing through github. Perhaps just sibling filenames so they naturally appear together, eg VCF4.5.tex and VCF4.5.ADR.md?

pd3 commented 2 months ago

There were attempts to raise awareness of VCF 4.5 release candidates on github, in a GA4GH email circular IIRC, and also via twitter (eg https://twitter.com/camerongenomics/status/1815619781199892971, including INFO/END, reposted numerous times including myself, John and Oliver). Understood that not everyone has time to attend the conference calls or carefully read every PR, but for a new release it may be prudent to catch up on the change summary as it's very hard to undo a change once it goes live.

Agreed that the GA4GH community could improve too. In light of that, do you have any ideas on how to better reach interested parties such as yourself? This has been a long standing issue. I faced the same problem with the SAM base modification tags. It was like pulling teeth getting any community review, until after a couple of years I gave up and published what I had. Then all the problems started to flood in, as no one wants to do the proof-reading stage. It's most demoralising as a maintainer!

I don't have a good answer for that, but I don't think Twitter does it. It's been my long held belief that there should be only one online channel that interested parties should monitor, otherwise information gets lost. Github issues and discussions seem like the most appropriate place to me. (By the way, the twitter post you show above has no mention of INFO/END, so that would not helped. Also note that I myself was commenting in that very thread #758, but completely missed the idea of deprecating the tag, because it was not listed as one of the major changes. Regardless, the topic about communication should be better moved into a separate discussion/issue, otherwise this thread becomes unreadable.)

Returning back to INFO/END, when you say it is "inaccurate", we first must understand its semantics. INFO/END is not as much about the length of the biological variant, but about the VCF record and its span. The tag should be viewed as a purely technical instrument for the indexer to understand which records overlap with any given chr:beg-end region. This has nothing to do with alleles in individual samples. I did not raise this as a problem because yes, in agreement with #758 it makes perfect sense to keep per-allele SVLEN, per-sample FORMAT/LEN, etc. But to me it makes no sense to deprecate INFO/END and compute it from FORMAT fields on the fly while indexing. That would be terribly inefficient. And if the idea is to stop indexing such records, then again, that would be such a major change that it must be made very clear and obvious, in github discussions and in the specification.

jkbonfield commented 2 months ago

As I said above, I think some of this is just miscommunication. The current specification states:

"\item END: Deprecated. Retained for backwards compatibility with earlier versions of VCF and older VCF indexing software which rely on this field being present."

That doesn't imply it's not included, but that it's no longer the primary source of information for VCF4.5 and is present purely for the purposes of backwards compatibility with older software. Granted I'm not sure the wording is ideal, but please suggest alternatives.

Whether that is wise or not is another issue, but fear over this breaking existing code is unfounded.

pd3 commented 2 months ago

The very wording "is present for the purposes of backwards compatibility with older software" is what I am having a problem with and what this issue is about.

My question stands: if you write a new indexer or a new code that checks record overlaps, are you expected to parse the entire FORMAT section to learn what previously was present in INFO/END, but does not have to be present any longer, according to the new specification?

jkbonfield commented 2 months ago

I think the language is poor and needs clarification. My understanding is "does not have to be present" isn't really true, or at least isn't advisable. I think it's basically trying to say the information isn't the right solution now to identify the size of the variant, but should still exist for purposes of indexing.

What's missing is the rationalisation and explanatory text detailing what authors of new tools ought to be doing. Is it advisable for anyone new coming to this to still use INFO/END, or should we be recommended them to deduce it via other means? I don't know the answer to this one. I would say though that there have always been things you need to do when dealing with multi-sample data. Eg if you do a range query and a sample subset, then INFO/END may not be enough as it's the superset of all samples and not necessarily the one you're querying.

I guess this fits in with the generally accepted API/library of the word "deprecated", where it implies this function is no longer the current recommended one and it's been replaced with something else (possibly but not necessarily also indicating it'll be removed at some point, but for us that'd have to be a VCF 5).

pd3 commented 2 months ago

That is correct in that the indexes we have so far are purely site-oriented, not sample-oriented. If the indexing is extended in the future to support individual samples, then parsing FORMAT fields may become possible.

But until that happens, I strongly disagree that INFO/END should or could be deprecated. I'd suggest to word it something like this:

The INFO/END value informs about the length of the reference sequence affected by the VCF record. If SVLEN and FORMAT/LEN is present, it is expected to be derived from INFO/END = POS + max(FORMAT LEN & INFO SVLEN). The tag INFO/END should be present to allow efficient indexing and record overlap checking.