samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
641 stars 174 forks source link

More than 65535 cigar opcodes #40

Closed jkbonfield closed 6 years ago

jkbonfield commented 10 years ago

There was discussion on the samtools-dev mailing list about this last year:

http://sourceforge.net/p/samtools/mailman/message/30672431/

The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator.

There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes.

Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too. Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)

colinhercus commented 7 years ago

I would feel comfortable with a change to BAM Magic string.

It would mean adding an option/flag to htslib API to open write BAMs with long CIGAR support and then a command line arg in programs like samtools view so they could open the required format.

This means files won't be recognised by programs not expecting the new format.

On 10 July 2017 at 22:37, Heng Li notifications@github.com wrote:

Setting n_cigar==65535 is the best option to crash htslib so far, but it is not good enough. For functionality that only looks at cigar, htslib won't simply crash. It will read into SEQ and QUAL and return absurd alignment lengths. What will happen next depends on the downstream code. You might get weird results without seeing a segfault. We wouldn't want this happen.

Regardless of what the spec say, we are dealing with the combination of 0x4 and cigar=="*" everyday: an unmapped mate is flagged with 0x4, may be on reverse strand and has coordinate but not cigar. This is almost the same as the use of 0x4 in my proposal except the pairing flag. Most tools that work with unmapped mates will have the desired behavior in my proposal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-314125543, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-mLOWSFAP78L_nHGkvTlqNVu80K_ks5sMjclgaJpZM4ChFgT .

tfenne commented 7 years ago

One other options (though this runs counter to @yfarjoun's preference for not having old code see partial data) would be as I suggested above whereby we put the first 65534 cigar elements in the cigar field in BAM, but then for the last element we sum the read length of the remaining elements and insert that as a soft-clip, then place the remaining elements in a tag.

This would allow older code to see the read as one with a lot of alignment, but some additional soft-clipping. And newer code would just look for the new tag, and if present use it's value to replace the soft-clip in the 65535th element in the cigar field.

lh3 commented 7 years ago

With partial data, you only retain the first half of reads, which leads to a bias. Biased data is worse than no data in my view. In addition, truncating long cigar complicates coding. The forward compatibility issue caused by BAM-only hacks is tiny and only temporary. At present, very few groups are working with long cigars; in future when long cigars become more common, users will naturally migrate to more recent htslib that won't have the problems we are discussing here. I don't think it is worth coding complex solutions or changing the BAM magic for a short-term issue that affects almost no one in practice. Those who may be affected (e.g. @nickloman, @mattloose and myself) are experienced enough to take care of themselves during the old-to-new version transition.

sjackman commented 7 years ago

3 Encode in such a way that old tools might fail if they encounter a BAM file containing reads with cigars with > 65535 elements

I prefer option 3 > 2 > 1.

What's the smallest invalid binary CIGAR value that causes samtools mpileup to exit with an error? I was hoping length 0 or *, but that doesn't seem to be the case. I'm thinking something like 1N. A possible solution to implement option 3 is to set CIGAR to this invalid value and encode the CIGAR string in the CG optional field.

lh3 commented 7 years ago

The logic behind pileup is complex. I can't immediately recall what samtools mpileup will do with 1N. It probably works.

In addition, it is not only mpileup that you need to crash. You need to think about other tools in samtools and other libraries (e.g. htsjdk, bamtools, sambamba, biobambam and go and rust libraries). It would be frustrating if some of these tools succeed (flagstat, sorting and merging will work fine no matter how much you mess up with cigar), some consistently fail, some randomly fail and some give weird results (a symptom of bad memory access).

sjackman commented 7 years ago

It would be frustrating if some of these tools succeed (flagstat, sorting and merging will work fine no matter how much you mess up with cigar)

If the tool doesn't need to inspect the CIGAR string at all, like flagstat, and it produces correct results, then I'm fine with it succeeding. That's great in fact.

In addition, it is not only mpileup that you need to crash. You need to think about other tools in samtools and other libraries

Let's see if we can get the most common tools to error out, and let's start with htslib.

Some experiments:

Setting CIGAR to * treats it as unmapped (as mentioned earlier), and produces a warning message when reading SAM, but gives no corresponding warning message when reading a 0-length BAM CIGAR:

[W::sam_parse1] mapped query must have a CIGAR; treated as unmapped

Setting the CIGAR string do a different non-zero length than the sequence string produces the error message when parsing SAM:

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 15
[main_samview] truncated file.

It stops processing the SAM file at the invalid record and produces a truncated BAM file when doing samtools view -b and exits with an exit status of 1.

I'd like to test what samtools does when reading such an invalid BAM file, where the CIGAR length doesn't match the sequence length. If it also exits with an error, setting the CIGAR string to an invalid non-zero length may be an option.

lh3 commented 7 years ago

The SAM parser is irrelevant. What matters is the BAM reader and all the downstream code. Indexing for example, probably will silently produce a wrong index if you give it a fake cigar. Bedcov may have a similar issue. View will sometimes work, sometimes segfault and sometimes give you wrong results.

You also need to check htsjdk, bamtools and others, as well as older tools a couple years ago. Hacking cigar is tricky because you are depending on how tools react to undefined behaviors. My hack is nearly the same as the treatment to unmapped mate. We can almost expect how other tools deal with it without reading their code or doing experiments.

sjackman commented 7 years ago

My concern with setting the unmapped flag remains, that analysis will silently produce incorrect results by ignoring just some alignments (those with long CIGAR strings) and produce seemingly correct results, but have ignored just the long alignments, producing a biased result.

lh3 commented 7 years ago

@sjackman 1) no one has a better solution. 2) your concern is hypothetical theoretical. In reality, those few working with long cigars right now are aware of the consequence of long cigars and will use new htslib. When long cigars become common, most will have migrated to new htslib. If you look back at the issue with my PR a few years later, you will see it affects almost nobody in practice.

sjackman commented 7 years ago

If the consensus is that we're okay with tools silently ignoring records with long CIGAR strings, then setting the BAM CIGAR length to 0 in BAM and to * in SAM should be enough to achieve this behaviour, without setting the unmapped flag. I haven't checked the behaviour of samtools when the BAM CIGAR length is set to 0, and the unmapped flag is not set.

An old samtools view will add the unmapped 0x4 flag when it sees * in a SAM file and emit the warning [W::sam_parse1] mapped query must have a CIGAR; treated as unmapped, as mentioned previously.

jkbonfield commented 7 years ago

A good point on the SAM parser being irrelevant (mostly), but as you've already pointed out - your solution has code to deal with converting a recent BAM to SAM via and old tool and then parsing that SAM with the latest tool chain, so it could still happen albeit rarely.

@lh3 If you've found that manually tweaking and/or corrupting a BAM file can lead to a segmentation fault in any of our code, please do raise a specific ticket with the details and ideally the test data. This is a bug and all such cases should return a non-zero exit code instead of crashing.

My gut feeling here is to keep within the current SAM spec. That way if it causes a problem for go bam, biobambam, io_lib, bamtools, etc then so be it - it's a bug in their code (and as previously noted I feel current htslib is bugged too) and we can justifiably make the change anyway. If we observe it breaks a tool then it's our duty to let them know, but it's not really our duty to avoid breaking it if we're still adhering to the spec. IMO that really means that changes like 'Z' for cigar shouldn't be considered if there are any good workable alternatives out there.

As for tools silently ignoring long CIGAR strings, realistically the tools right now aren't being used because they can't be used and we're talking about future data coupled with old tools (or modern tools linked against old libraries). This is a potential problem, but maybe it can be mitigated. Eg if we have 1 read out of 1000 that has n_cigar > 65536 then an old tool would just give slightly inaccurate results, likely not noticed and potentially leading to false claims. If however 1000 out of 1000 reads are encoded using * and CG:B,I then the inaccurate results are so obvious that even with a zero exit status it'll raise questions that need investigating. This then makes me wonder if there should be options in the tool chain to forcibly switch to CG:B,I for all cigar data so people generating the data files, if they know their data has a realistic chance of being too long, can produce files that aren't going to lead to subtle bugs with older tools.

Agreed forcibly doing something illegal is an even surer way of making the inaccurate assessment more obvious, but it raises other challenges and we can't be sure people would notice those illegalities either!

jkbonfield commented 7 years ago

Also, for what it's worth, io_lib has been using the bin field to code this for years due to the occasional need to store Gap5 consensus sequences in BAM (very similar to the issue in Mira which originally reported this problem back in 2013). https://www.freelists.org/post/mira_talk/Parse-error-at-line-84-unmatched-CIGAR-operation

I believe failure to decode cigar via this method leads to mismatched sequence and cigar length which causes hard failures in decoding. However it's unlikely anyone ever tried to use my own BAMs with anything other than Io_lib and gap5 tbh, so it's hardly had rigorous cross-tool testing. I also have no idea what tools take bin verbatim rather than computing on the fly (as htslib and samtools have done for years). If we're looking at making minimal changes to the BAM specification, then requiring bin to be ignored when the top flag bit is set may be back on the cards again?

jkbonfield commented 7 years ago

Some investigations on how to make BAM files blow up the tools. (Not that I'm specifically advocating it, but it's useful research when laying out the options available.)

Regarding SAM vs BAM parser, Heng's correct that a lot of checks are only in the SAM side of htslib. Possibly the same too for Picard.

Producing a malformed BAM file (with an invalid CRC too as I edited the binary stream without computing a new CRC) gave me a BAM file that both htslib and picard were happy to display, with mismatching cigar and sequence length. Not even checking the CRC is shoddy really! Scramble notices the CRC but with -! parameter to disable checking it also happily converts the BAM to SAM without noticing the cigar length mismatch. Samtools mpileup also swallows the BAM and produces (an incorrect) pileup output.

Setting flag to 32768 is also accepted without question by both tools, so there is no validation on flags.

So we can conclude there really aren't many ways you can make data invalid such that samtools notices! I think it was designed for speed with absolutely minimal processing of the BAM struct. It does however spot a BAM\1 magic number change, as does picard.

About the only change I can think of that is likely to trigger a catastrophic error without changing magic number is setting n_cigar to 0xffff but not leaving room for the cigar operations (and instead using that as a marker for "too many, see CG tag"). This then gets the BAM records out of sync and subsequent decodes will fail, but it's not necessarily guaranteed to - it's possible it could fluke itself past a few hundred BAM records (interpreted as an "interesting" cigar string) and get back in sync again. Maybe... it all depends whether the software validates the block_length field (I didn't check).

In short, we're probably stuffed if we want to keep the magic number and make software fail on new data.

lh3 commented 7 years ago

it's possible it could fluke itself past a few hundred BAM records

Typically you only see long cigars when reads are longer than ~300kb. The exact length depends on the scoring matrix and how permissive aligners are. 64k cigar takes 65536*4=262144 bytes. 300kb takes 150k bytes in memory. So, if you set n_cigar to 0xffff, htslib will mostly take SEQ as CIGAR. If SEQ is not long enough, htslib will read into the area of QUAL and then optional tags. Because we are appending the real CIGAR as the last tag and that cigar is longer than 64kb, htslib will not trigger an invalid memory access. It will just take everything as CIGAR. Something like bam_cigar2rlen will repeatedly overflow and return weird numbers, including negatives. I don't know what that will cause. However, if bam_cigar2rlen happens to return a sensible number, it will fool downstream code and let them silently produce wrong results. Either way, we can't reliably make htslib crash or abort due to errors.

For security purpose, it would be good to have a function to check the integrity of bam1_t, at least making sure normal htslib APIs don't read beyond bam1_t::data – for now, carefully crafted n_cigar, l_qseq and l_qname could trigger invalid memory access somewhere, I think. Additional checks could require reference and query length inferred from CIGAR don't overflow in 31 bits, etc. bam_read1 could optionally call this function. EDIT: also traverse tags. A crafted size in the B type could also trigger invalid memory access.

bioinformed commented 7 years ago

You are all aware that htslib is used in some way by most sequencing vendors and is part of the ecosystem that produces results for hundreds of thousands of patients undergoing genetic testing (like you and your relatives, should the need arise)?

Clever technical hacks to sneak in specification changes to the BAM format are fine for research-use software, but seem like an unnecessary risk for such key NGS infrastructure. Why not issue a revised spec to deal with encoding long CIGARs in the most sensible manner, bump the file version number and let these new files be unsupported except by tools that can adhere to the new spec? Isn't this is how file format specifications are supposed to work? Given the fringe nature of the application in question, this standards-based approach seems vastly simpler and healthier for the ecosystem than the proposed alternative.

Thanks, -Kevin (speaking for myself and not my employer)

On Tue, Jul 11, 2017 at 7:01 AM, James Bonfield notifications@github.com wrote:

Some investigations on how to make BAM files blow up the tools. (Not that I'm specifically advocating it, but it's useful research when laying out the options available.)

Regarding SAM vs BAM parser, Heng's correct that a lot of checks are only in the SAM side of htslib. Possibly the same too for Picard.

Producing a malformed BAM file (with an invalid CRC too as I edited the binary stream without computing a new CRC) gave me a BAM file that both htslib and picard were happy to display, with mismatching cigar and sequence length. Not even checking the CRC is shoddy really! Scramble notices the CRC but with -! parameter to disable checking it also happily converts the BAM to SAM without noticing the cigar length mismatch. Samtools mpileup also swallows the BAM and produces (an incorrect) pileup output.

Setting flag to 32768 is also accepted without question by both tools, so there is no validation on flags.

So we can conclude there really aren't many ways you can make data invalid such that samtools notices! I think it was designed for speed with absolutely minimal processing of the BAM struct. It does however spot a BAM\1 magic number change, as does picard.

About the only change I can think of that is likely to trigger a catastrophic error without changing magic number is setting n_cigar to 0xffff but not leaving room for the cigar operations (and instead using that as a marker for "too many, see CG tag"). This then gets the BAM records out of sync and subsequent decodes will fail, but it's not necessarily guaranteed to - it's possible it could fluke itself past a few hundred BAM records (interpreted as an "interesting" cigar string) and get back in sync again.

In short, we're probably stuffed if we want to keep the magic number and make software fail on new data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-314435963, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM-v5uMQWD1hbRI2hAYleybVTrointKks5sM3I7gaJpZM4ChFgT .

sjackman commented 7 years ago

@jkbonfield wrote…

Eg if we have 1 read out of 1000 that has n_cigar > 65536 then an old tool would just give slightly inaccurate results, likely not noticed and potentially leading to false claims. If however 1000 out of 1000 reads are encoded using * and CG:B,I then the inaccurate results are so obvious that even with a zero exit status it'll raise questions that need investigating.

Good suggestion, James. The command line option could specify

I'd suggest that the default behaviour for samtools could be --cigar=short could be to start, and eventually change to --cigar=long when most other tools have caught up, with --cigar=mixed only being used when specifically requested by the user.

lh3 commented 7 years ago

If those in clinical labs care about security, they should upgrade to the latest version. They will be fine. It will take several years for clinical labs to use ~300kb reads. We are basically issuing a warning several years ahead: BAM has a bug with long cigars; please upgrade. In practice, this long-cigar concern is minor to a whole lot of other security/correctness issues in clinical pipelines.

One reason that BAM stays along is its long-term stability: older tools work with recent BAMs. Changing the magic number breaks this, but many tools work with long cigars just fine. To me, every design decision is about trade-off. I don't see this long-cigar issue is big enough to warrant a magic number change.

Encoding all reads with CG:B won't work well because long cigars only affect ~0.1% of reads so far. We can't tell before hand if our SAM has a long cigar. We need a 2-pass procedure. We will have to add new functions and CLI flags to enable that. At the low level, htslib doesn't allow flags passed to bam_write1(), so we also have to restructure APIs a bit or add a global flag as a hack. In the end, we will pay efforts and increase the code complexity for a theoretical problem that probably affects no one in practice.

jkbonfield commented 7 years ago

@bioinformed I agree, but I think it was discussed somewhere (I forget where, maybe a GA4GH conference call) and the conclusion was there is no appetite for bumping the BAM version. Personally I'd be inclined to say if there is something which cannot be encoded in BAM then perhaps that is a good time to ditch it and move to CRAM, which is a huge saving in disk space. If there are defects with doing that, then that is where we should be expending our effort on, not coming up with a new major BAM version update.

You could argue that doing this requires people to upgrade software etc, but so does changing the BAM specification. The only positive thing about keeping BAM is for third party implementations in whacky languages that haven't simply been implemented to link against htslib or htsjdk.

I guess practically speaking if we had a new BAM version we'd have to make the tool chains write out the old one by default with an explicit flag to switch to the new ones (and probably use it for internal things like temporary sort buffers). Or even go down the route of never use the new version until we actually hit the "cigar is too long" error case, although that's messy in pipelines. This would limit the impact to only the data which is likely to not be representable in the current BAM\1 anyway. (I'd also take it as the right time to switch cigar and read-name data in the format so we no longer have to work around unaligned memory accesses!)

All that said, I'm not actually against changing the magic number myself if that's what people wish - I can see both sides of the coin on this one. I'm just repeating my recollection from previous discussions. Maybe people had time to think on it more too.

yfarjoun commented 7 years ago

Here's another suggestion. (Assuming you'll aren't sick of this thread by now)

To signal an extra long cigar, modify the cigar to be: <ReadLength>S, i.e. an entirely clipped read. This will have a similar effect (as unmapping) on old programs as they will presume that there's no good data in the read, but it will enable spec-compliant programs to take this value as a sentinel indicating that there might be a different cigar string in the CG tag.

Advantages of this approach:

What do people think?

lh3 commented 7 years ago

@yfarjoun and I have discussed this option. My concern is that a <read_length>S is rare in existing BAMs. It might trigger hidden bugs in old tools that haven't considered it. Nonetheless, this is a minor concern. I can live with it.

I have implemented the idea. Multiple old versions of samtools view and mpileup silently ignored reads with a full-clipping CIGAR, as is expected. IGV-2.3.82 did the same.

@jkbonfield what do you think? If you can't think of problems this solution may cause, we (Broad, Sanger and myself) will have a consensus.

colinhercus commented 7 years ago

Old (i.e. not updated for this change) mark duplicate programs that adjust mapping position using soft clipped bases would do the wrong thing with these BAMs

On 14 July 2017 at 06:15, Heng Li notifications@github.com wrote:

@yfarjoun https://github.com/yfarjoun and I have discussed this option. My concern is that a S is rare in existing BAMs. It might trigger hidden bugs in old tools that haven't considered it. Nonetheless, this is a minor concern. I can live with it.

I have implemented the idea. Multiple old versions of samtools view and mpileup silently ignored reads with a full-clipping CIGAR, as is expected. IGV-2.3.82 did the same.

@jkbonfield https://github.com/jkbonfield what do you think? If you can't think of problems this solution may cause, we (Broad, Sanger and myself) will have a consensus.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-315216943, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-ruc8UmYuXJgNxnBq1R8hMo490X9ks5sNpbugaJpZM4ChFgT .

lh3 commented 7 years ago

@colinhercus You are right, but I wouldn't worry about old tools that are not intended for ~300kb reads.

jkbonfield commented 7 years ago

@yfarjoun I can see that working, although it does still feel unclean compared to using CIGAR * to mark the data as unavailable (in BAM only). Going through the 3 points in turn.

All things considered, it's the least bad option so far!

jkbonfield commented 7 years ago

@colinhercus What duplicate marking programs modify CIGAR strings and/or coordinates, and why? It seems an odd thing for dup marking, unless we're talking about tools that attempt to avoid overlapping reads on the same template from being overlapping by clipping one or both?

Edit: discussing it with colleagues I understand now. The dup programs aren't modifying data, but they're internally using this data to adjust the position before looking for duplicate positions. However I don't see how this change breaks it significantly.

Short reads would be fine. Long reads will be wildly differing lengths, which means even if we set a huge soft-clip then we're very unlikely to have both pos + len being the same. If you want to be 100% cast iron certain the computed position doesn't change, then I guess we can use 0MxS (or nS0MxS in the case of fixing up a very long cigar string that already started with nS). This is still something we can detect as having zero bases matching the reference, in conjunction with our CG aux tag.

colinhercus commented 7 years ago

@jkbonfield https://github.com/jkbonfield we don't modify the cigar but we subtract the length of the soft clip from POS to get a soft clip adjusted mapping location which is then used to detect duplicates (same mapping location for read 5'). It's possible that duplicate reads may have different length cigars and hence some have S and some the full CIGAR.

What's the chance of someone marking duplicates on one of these BAMs without having an updated program?

Possibly slim, but if it does happen there is no error reported, just incorrect operation.

I also have difficulty insisting users update to latest and greatest version of everything. I wish they did but in practice they download a version, test it, develop their pipelines and then like to stay on that version for a long time. We have users on 8 YO versions of novoalign. They are happy with it and keep using it.

We can adopt these changes and can "certify" our updated programs to handle long CIGARs and we can notify most of our users but as long as you don't change the magic string then old programs run on BAMs with long CIGARs may do the wrong thing and no one will be the wiser.

On 14 July 2017 at 16:16, James Bonfield notifications@github.com wrote:

@colinhercus https://github.com/colinhercus What duplicate marking programs modify CIGAR strings and/or coordinates, and why? It seems an odd thing for dup marking, unless we're talking about tools that attempt to avoid overlapping reads on the same template from being overlapping by clipping one or both?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-315300585, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-pXYNDQ4IJ2ZxbVMB4vDxz5SWWd1ks5sNyPfgaJpZM4ChFgT .

bioinformed commented 7 years ago

@hengli: Clinical labs do care about security, but upgrading to latest versions of software is complicated and not always possible. When we can update versions, it often requires non-trivial work in terms of verification, validation, and generation of regulatory documentation. Changes to specifications and implementations that introduce new and complex behaviors in an uncontrolled manner have to be taken into account even if such data are not in routine use. And by uncontrolled, I mean that there is no clear designator to delineate requirements/conformance to a particular version of a specification and an altered version of that specification. So my concern has almost nothing to do with the technical complexity of the proposed change or the rarity of the data. It is about introducing an uncontrolled change into a core standard and piece of infrastructure that adds unnecessary risk, further fragments the semantic interoperability of a somewhat fragile ecosystem, all for very little benefit.

@jkbonfield: Appetite for it or not, at some point it becomes pointless to have a specification that isn't versioned and for which implementations cannot be verified for conformance. If there is truly no appetite for introducing a new BAM version, then the logical outcome should be to reject any proposal to change BAM encoding semantics.

I can see (and benefit from) how much care is taken by the htslib developers to improve the quality of implementation, particularly around robustness and security. However, much of those efforts become meaningless when the only specification is the current version of the code and the only authoritative documentation for what behaviors to expect is the code itself. All of this is making me wonder whether it is a serious goal for this community to adopt practices that will enable htslib and the related tools to be safely used for delivery of health care.

Thoughtfully, -Kevin (still speaking for myself and not my employer)

On Tue, Jul 11, 2017 at 9:44 AM, Heng Li notifications@github.com wrote:

If those in clinical labs care about security, they should upgrade to the latest version. They will be fine. It will take several years for clinical labs to use ~300kb reads. We are basically issuing a warning several years ahead: BAM has a bug with long cigars; please upgrade. In practice, this long-cigar concern is minor to a whole lot of other security/correctness issues in clinical pipelines.

One reason that BAM stays along is its long-term stability: older tools work with recent BAMs. Changing the magic number breaks this, but many tools work with long cigars just fine. To me, every design decision is about trade-off. I don't see this long-cigar issue is big enough to warrant a magic number change.

Encoding all reads with CG:B won't work well because long cigars only affect ~0.1% of reads so far. We can't tell before hand if our SAM has a long cigar. We need a 2-pass procedure. We will have to add new functions and CLI flags to enable that. At the low level, htslib doesn't allow flags passed to bam_write1(), so we also have to restructure APIs a bit or add a global flag as a hack. In the end, we will pay efforts and increase the code complexity for a theoretical problem that probably affects no one in practice.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-314486378, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM-v4pEx_dNdl2uJF1iP2tXjykA88rcks5sM5hggaJpZM4ChFgT .

jkbonfield commented 7 years ago

Changing the BAM version number requires you to upgrade your software if you wish to work with modern data sets. Changing the encoding of ONLY extra-long reads doesn't require you to change your software if you don't wish to work with long reads, but you can still work with modern data produced by others.

So there are pros and cons to all decisions. I'd love for it to be a simple case of this is right and that is wrong, but whatever we do we cause work for others. The discussions here show that we're actively engaging and trying to come up with the most sensible solution, with the ideal scenario being one that doesn't invalidate old software but is detectable when old software meets new data that it cannot process. This is, sadly, easier said than done. If it's not possible to achieve then I would indeed back a bump to the major version number of BAM.

(And in doing so would also back for the minor version number to be mandatory, which unfortunately it isn't right now!)

bioinformed commented 7 years ago

@jkbonfield: I apologize if I'm coming across and sounding sanctimonious, judgemental or ungrateful to this community. It has been one of the more challenging aspects of my last several jobs to bridge the gap between the great "research-grade" software that is generated by our field in general and adapt it to be used as part of delivering healthcare. Much of that challenge is demonstrating to our lab directors, quality assurance staff, inspectors and regulators that we can safely use Software of Uncertain Provenance (SOUP), which encompasses virtually all academic and commercial NGS software that was not designed from the start to comply with medical device quality regulations.

So I am asking this community if they have the desire to adopt a more conservative approach to specifications and change control or not. If the answer is no, then another form of pragmatism kicks in and we have to reevaluate whether building clinical applications on top of htslib is the best way to proceed.

Thanks, -Kevin (definitely not speaking for my employer)

On Fri, Jul 14, 2017 at 6:56 AM, James Bonfield notifications@github.com wrote:

Changing the BAM version number requires you to upgrade your software if you wish to work with modern data sets. Changing the encoding of ONLY extra-long reads doesn't require you to change your software if you don't wish to work with long reads, but you can still work with modern data produced by others.

So there are pros and cons to all decisions. I'd love for it to be a simple case of this is right and that is wrong, but whatever we do we cause work for others. The discussions here show that we're actively engaging and trying to come up with the most sensible solution, with the ideal scenario being one that doesn't invalidate old software but is detectable when old software meets new data that it cannot process. This is, sadly, easier said than done. If it's not possible to achieve then I would indeed back a bump to the major version number of BAM.

(And in doing so would also back for the minor version number to be mandatory, which unfortunately it isn't right now!)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-315352956, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM-v7k3GzP-gqXklLHWcS1HYxHud-Heks5sN2WJgaJpZM4ChFgT .

lh3 commented 7 years ago

All things considered, it's the least bad option so far!

So, we are moving towards a consensus. I have updated samtools/htslib#560 to store an alignment with long CIGAR using a fully clipped fake CIGAR. I also also created #227 to make modifications to the SAM spec and the tag spec.

jkbonfield commented 7 years ago

What did you think about the approach of (eg) 0M10000S for a cigar string that originally started with, say, 10M or 10S0M99990S for a cigar string that originally started with 10S? It keeps the pos field computed by algorithms such as mark dup the same, but it's still got the CG tag so the real aux data is elsewhere and overrides this fake cigar? If 0M is problematic then try 0N or similar instead?

lh3 commented 7 years ago

@jkbonfield: @yfarjoun and I have also discussed something similar, including using padding. In my view, a more complex fake cigar has three drawbacks: 1) it makes the fake cigar look even uglier; 2) it increases the chance of triggering hidden bugs in old tools; 3) it further complicates coding (e.g. we need to consider a CG tag with one CIGAR only in theory). At the same time, using a complex fake cigar has little added benefit because most old tools that prefer 0M10000S are not supposed to work with ~300kb reads. We might be complicating code and spec for a problem that does not exist in practice.

daviesrob commented 7 years ago

Sorry, I think the only sensible solution here is to bump the BAM version number.

There are basically three options that have been discussed here:

  1. Try to write a BAM file in such a way that old tools will fail when they come across a long CIGAR.

This is impossible to do reliably, especially if the file is read by old tools.

  1. Write the file in a way that causes old tools to ignore the reads with long CIGARs, but otherwise work.

This means they may silently give the wrong output. This is not at all a good idea, especially if we want GA4GH specifications to be taken in any way seriously.

  1. Bump the version number in the BAM file.

This is, I think, the only realistic solution. Anyone unable to read the file is pretty much guaranteed to find out instantly.

The way I would implement this would be to write out version 1 BAMs for now (unless the input happened to be a version 2 file), with a command-line option allowing the use of version 2. If a tool writing a BAM1 file comes across a long CIGAR, then it should fail, writing out a message describing how to upgrade to BAM2. This gives time for users to upgrade to versions of software that can understand the BAM2 format. Eventually (when long CIGARs become more common than they are now), we could flip over to generating BAM2 by default.

The people most affected by this would be the ones using very long-read platforms. But as they're in the process of writing software to use this data, they're also likely to be the ones in the best position to absorb the changes.

jkbonfield commented 7 years ago

So going back to some of my earlier comments:

Eg if we have 1 read out of 1000 that has n_cigar > 65536 then an old tool would just give slightly inaccurate results, likely not noticed and potentially leading to false claims. If however 1000 out of 1000 reads are encoded using * and CG:B,I then the inaccurate results are so obvious that even with a zero exit status it'll raise questions that need investigating.

This still applies, even to Heng's 100000S idea. If we do anything that doesn't change the BAM major version number then it has to be blatantly obvious rather than subtle and potentially under the radar.

All that said, I'm not actually against changing the magic number myself if that's what people wish - I can see both sides of the coin on this one. I'm just repeating my recollection from previous discussions. Maybe people had time to think on it more too.

Despite discussing various ways that modifying the existing bams can be achieved in a less dangerous manner, I'm still in agreement with what I stated there. I certainly wouldn't object to a major version number change either. However if so, we absolutely should fix some of the other simple issues too, such as making the minor version number mandatory (this has implications for BAM where even the textual form of SAM header is completely optional, let alone whether it contains an \@HD line) and tightening up of the overly-liberal list of characters permitted in reference and query names.

From @bioinformed: So I am asking this community if they have the desire to adopt a more conservative approach to specifications and change control or not. If the answer is no, then another form of pragmatism kicks in and we have to reevaluate whether building clinical applications on top of htslib is the best way to proceed.

I do try to take these issues seriously and have hopefully have managed that with any CRAM changes (although the spec could still do with a major revamp to tidy it up). However we also have to be serious about not breaking the huge wealth of existing code on short read data, so like I say both sides come with problems. Like it or not, anyone running an existing pipeline in a clinical setting designed around short read technology and then shoving a bunch of long read data through it without reevaluating whether that pipeline works, already has bigger issues than this corner case! If we can make it blatantly obvious our code doesn't support this data (until upgraded) then is that sufficient? It's not ideal and a hard failure would work better for sure.

Practically speaking we need to work out how this is achieved in actual software, as a spec change is meaningless without code. If we want something completely transparent that requires no extra command line arguments, then we only have two choices - always write a new BAM format (and break the entire world), or keep the existing format and silent change the fewest reads possible (risking it being very hard to detect for old software). Neither of these are ideal. If instead we accept that the current state of play is OK (works on all data bar very long reads, on which it hard fails to interpret) with a command line argument to control how to proceed further, then actually it opens up all possible scenarios with equal "pain" in implementation and use within pipelines, but differing outcomes on compatibility / visibility of change.

lh3 commented 7 years ago

However if so [changing BAM magic], we absolutely should fix some of the other simple issues too

In principle, I don't object to increasing the magic number if we have accumulated enough breaking changes that together will simplify our life. However, we should caution that changing versions may involve complex behaviors. BAM parsing libraries need to consider:

BAM1 is 8.5 years old. This is the first time I think over what a BAM version bump means to us. Some of these concerns in the list may turn out to be minor in practice, but other new ones may arise. If we want to go down this route, we shall carefully consider all the side-effects: it is not just htslib that we need to change; it is the established ecosystem that has the most weight.

[from @bioinformed] However, much of those efforts become meaningless when the only specification is the current version of the code and the only authoritative documentation for what behaviors to expect is the code itself.

On the day I submitted samtools/htslib#560, I said "Yes, hts-specs needs to be changed first". I have submitted a pull request, too. This will be documented in the spec such that other libraries can implement it. I may submit a pull request to bamtools for long cigar once the issue is settled here. This will get bedtools ready for ultra-long reads.

jkbonfield commented 7 years ago

What do version numbers mean to compatibility. Say we add a minor version number to BAM like BAM\2\1. Should we let programs abort if we see BAM\2\2in future? If so, there is no so much need of a minor number. Perhaps the desired behavior is to let programs give a warning? We need to explain this in the BAM spec.

The short answer to this is we should fail on minor version number increments. (Perhaps with a command line option to fly by the seat of the pants and continue anyway.)

If you look at the spec changes https://github.com/samtools/hts-specs/pull/218/files you'll see the 1.x upgrades have always traditionally been due to major new features which while forwards compatible, may cause backwards incompatible changes. We've also had dozens of minor updates, which are both forward and backward compatible, which don't warrant a minor version update. These would be equivalent to the third version number in other projects - eg 1.5.0 to 1.5.1.

1.3: Added cigar N operator. 1.4: Permit IUPAC in sequences, QNAME * and added B aux type. 1.5: Supplementary flag.

This means code built against an htslib that predates SAM 1.5 would incorrectly handle 1.5 data and may be misinterpretting flags. Whenever the SAM version changes, you can't assume it'll be understandable by old code! This is nothing new.

I know here we're talking about BAM format and not SAM, but a similar logic applies. This is why I posted https://github.com/samtools/hts-specs/issues/228. We need to be more serious about this for future updates.

daviesrob commented 7 years ago

OK, so I've been doing a bit more work on this one. So far, I have two solutions: Most sensible: Change the BAM magic number and update the format to allow 32-bit n_cigar. Less sensible and hacky but does seem to work:

In the BAM alignment record, write the block_size as -8. This magic value has the useful property that bam_read1() in old versions of HTSlib/SAMtools should either crash or return an error code on reading this (as the return value is actually block_size + 4.) After the fake block_size you then write the real block_size, then the normal BAM record up to tlen (taking care to set n_cigar_op in flag_nc to zero), then the real n_cigar_op as a 32-bit value and then the rest of the record.

There's a test implementation in https://github.com/daviesrob/htslib/tree/bam_cigar_hack if you want to try it. It can round-trip its own output in BAM format successfully.

Testing on older tools, I got:

samtools 0.1.19: seg fault samtools 1.1+ : "truncated file", exit 1 picard (all versions): exception: "Invalid record length: -8" scramble 1.13.: seg fault scramble 1.14.: "Failed to decode sequence", exit 1 biobambam : attempts to read a 4G record, then fails either with EOF or "Length of null terminated query name is inconsistent with alignment header".

Apart from biobambam, I'm reasonably pleased with those results.

daviesrob commented 7 years ago

My alignment of the twelve biggest reads in @nickloman's dataset finished (after 7 hours!) The results are in ecoli_bigreads.sam.gz (7.3Mbytes) if anyone wants a test case. There are two reads that go over the 64K CIGAR operations limit:

Read CIGAR ops.
9fab1cc8-ba25-42a1-be65-e0a27875883c_Basecall_1D_template 70232
0164a283-91f7-4d4f-9090-45abcb6fe824_Basecall_1D_template 75969
lh3 commented 7 years ago

@daviesrob Your solution is changing the BAM file structure. I think it is a little too hacky. It will also be tricky to formally explain it in the spec. I don't like it to crash tools when CIGAR is not used, either. That said, I am ok with it if others think it is better. So far, the following three solutions are good to me:

  1. Move long CIGAR to an optional tag. This is the solution @yfarjoun, @jkbonfield and I have agreed on.

  2. Reuse bin for the higher 16 bits of large n_cigar (the proposal in 2013). In an internal conversation, @jkbonfield suggested that this solution has some values: it often crashes older tools and is easy to implement. @jkbonfield and I are ok with this.

  3. Write negative block size (@daviesrob's solution). This is not my top choice, but I can live with it.

By the way, for ultra-long reads, minimap2 will be better. It took 3 seconds to align these 12 reads.

yfarjoun commented 7 years ago

This has been open for a while..but a vote is coming up, so I'll voice my opinions via-a-vis version bump.

A version bump is a large undertaking. It requires code to support new and old versions, keeping track of which is which and validating and parsing correctly. Given this, I think it would be wise to collect many issues that we would like to include in a version bump so that we do not need to change the version for a while after we do so. This will take a long time: to debate, read through old issues, talk to folks etc. agree on what's in the new version and what is not.

I agree that the New Version should have more room for cigar operators. I suspect that there are a few other things we'd like to put it in "while we are at it...".

I think that in the meantime, a hacky solution that is quick to implement and enables the folks that need access to long cigars to get it is good to have, and having it would free us to have a good conversation about BAM 2.0 without feeling rushed.

daviesrob commented 7 years ago

Hacking around this problem is a very bad idea. Whatever you do will fail when the data is passed into old tools that are not aware of the change. They will either blow up in random ways (leading to lots of unnecessary support requests), or even worse apparently work but give misleading results. The fact that this is undesirable has now been discussed at length.

The only way to reliably indicate that long CIGAR strings are present is to change the BAM version number.

Note that this change would only affect the BAM part of the specification. SAM can already handle long CIGARs, so it would not be affected. In the BAM representation, the only things I would do are:

We've been discussing this on and off since 2013, so I don't think anyone could accuse us of rushing the decision. Since then, the community using these formats has become much wider. Trying to hack around problems does not go down well with some members (see some of the comments above). If we want to be taken in any way seriously we have to do this sort of change properly these days, not come up with half-baked patches and hope that they will work.

bioinformed commented 7 years ago

I know that I sound like a broken record, but please remember that htslib and its ecosystem is used by clinical labs and in medical devices. While great strides are being made in terms of bringing the code up to the quality needed to safely deliver healthcare, even more diligence needs to be applied to maintaining our standards. Having standards that cannot be validated against defies the purpose of having standards. Just as it is bad practice to revise a released version of code without updating versioning, not releasing or post hoc altering standards without versioning is worse for our community.

Thanks, -Kevin

On Thu, Sep 7, 2017 at 5:50 AM, daviesrob notifications@github.com wrote:

Hacking around this problem is a very bad idea. Whatever you do will fail when the data is passed into old tools that are not aware of the change. They will either blow up in random ways (leading to lots of unnecessary support requests), or even worse apparently work but give misleading results. The fact that this is undesirable has now been discussed at length.

The only way to reliably indicate that long CIGAR strings are present is to change the BAM version number.

Note that this change would only affect the BAM part of the specification. SAM can already handle long CIGARs, so it would not be affected. In the BAM representation, the only things I would do are:

  • Get rid of the pointless bin field. flags could be moved into there, allowing...
  • n_cigar_op to be increased to 32 bits long And a couple of optional extras:
  • Swap round cigar and read_name
  • (Maybe) add support for doubles in aux tags. HTSlib has unofficially had these forever, but they were never put into the specification.

We've been discussing this on and off since 2013, so I don't think anyone could accuse us of rushing the decision. Since then, the community using these formats has become much wider. Trying to hack around problems does not go down well with some members (see some of the comments above). If we want to be taken in any way seriously we have to do this sort of change properly these days, not come up with half-baked patches and hope that they will work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-327776605, or mute the thread https://github.com/notifications/unsubscribe-auth/ABM-v6pU8YWFrTFqRvE7rqIROEHjqDoQks5sf9iOgaJpZM4ChFgT .

lh3 commented 7 years ago

@daviesrob if we want to have BAM 2.0, another important feature is to build version check into the format. However, I actually don't think we should ever have BAM 2.0. Version bump has major effects on endusers. The current proposal (BAM1 by default; abort on long cigar) looks minor right now, but will get worse and worse into the future. Endusers will be dealing with fragmentations and will encounter short-read BAM2 which crash older tools. This will be a mess. All the features in your list are too minor for us to take the huge risk. In contrast, @yfarjoun and my proposal only has a short-term effect (EDIT: well, it has long-term effect, but the effect fades with time) and even in this short term, very few endusers will get a hit by accident.

@bioinformed If clinical labs care about quality, they should use the latest htslib. Then #227 will work fine. Version bumping requires upgrade, too. The two solutions are the same in this aspect. As to "standards", we strive to achieve them only when they don't interfere with endusers. But right now, the "standards" are getting in the way. User experiences are far more important to me.

a vote is coming up

I am not in the format group, but do count my vote. Also, when voting, please think about the overwhelming majority of users who are not vocal; think what is best for them in the long term.

daviesrob commented 7 years ago

@lh3 The one thing that samtools has always checked is the BAM magic number, which is why changing it is the only way to solve this problem.

If we're careful when implementing the change then the disruption should be minimised. For example, to begin with we might only write BAM2 if either it was specifically requested, or if a BAM2 file was one of the inputs. That way, people working exclusively with short reads shouldn't notice much change to begin with.

lh3 commented 7 years ago

to begin with...

The disruption is only minor at the beginning. When BAM2 becomes more common, users will produce them for short reads, assuming BAM2 is the way to go. This will start to break old tools, as well as SRA etc. In addition, both us and users need to decide when to output BAM2 by default. It is not an easy decision. This is like a timed bomb that may have a big splash in future.

jkbonfield commented 7 years ago

This is just dragging on too long, so a summary of where we stand to try and clarify things may help. I see several possible solutions, with pros and cons. They broadly fall into these categories.

  1. NOP - do nothing! Tough luck, use CRAM :P

  2. Methods that change the on-disk representation in BAM for records with too many CIGAR elements, but keep the in-memory representation the same. Examples - using flags, aux tags, reusing bin field. Pro: it's mostly transparent and easy to do. Con: sadly we have a lot of old software with embedded copies of htslib which won't do the reverse trick when reading BAM. This offers the potential for silent bugs.

  3. As per 1, but the encoders enforce all records to use the new CIGAR representation for BAM, enabled by command line option and we have a hard failure if it's not enabled and we find a long CIGAR. Pro: easy to do and unlike 1 above it's more obvious if you read a long-read file using an old library as all the data now appears to be unaligned - no subtle bugs. Con: we run the risk that people simply enable the command line option for all files, meaning a proliferation of data that old software intended to work on Illumina data no longer works on.

  4. As per 1 and 2, but we switch mode the instant we see a long CIGAR string. It's automatic and needs no command line option. Pro: we only ever have new format files for long reads and old software will work with old sequencing technologies just fine. Con: realistically we're back to square one on the silent bug front. Potentially we may have long cigars at the end of our data stream and thus have hard to spot changes, but usually it's still fairly obvious.

  5. Some malformed BAM record, such as changing the size field. Pro: it's automatic with no command line option, thus old software works fine on all reads that currently can be supported by BAM. Old software on long-cigar files will crash and burn rather than silently continuing. Con: it's a total and utter hack, plus the way of reporting failure is extremely unfriendly and will lead to bug reports.

  6. New BAMv2 (perhaps referred to as BAML or BigBAM). Pro: it's the most clean solution, plus it may actually force all that old legacy code to be updated. We also get the option to fix some other issues. Con: the old legacy code may not update after-all (possibly unmaintained). It has the same issue as 2 in that people will likely just hard code a e.g. -v2 option to their pipeline and we get a proliferation of needlessly v2 BAMs out there. It's also rather unobvious - BAMv2 is BAMv1 but better, so why wouldn't we want to use it? (Edit: but see BAML/BigBAM name suggestions to combat this)

dkj commented 6 years ago

W.r.t. option 5 "New BAMv2", how about trying to avoid the human "proliferation of needlessly v2 BAMs" problem by not calling it BAMv2: call it BAML (or some variant) instead?

droazen commented 6 years ago

I vote for option 5 (BAM v2). It would force us to finally version our bam parsers, which would make issues like this much easier to deal with in the future. Without versioned parsing code we are always walking on eggshells when considering changes to the spec.

jkbonfield commented 6 years ago

We decided to go with the above categories (and expand on it later if we pick something which has multiple implementations; eg aux vs flag vs bin) and poll the user base to see how people prefer the tools & formats to work. I took the liberty of sneaking in the "do nothing" option incase all of the above are too unpalatable.

My primary preference is for 5, but I could accept 2. 1 I'm most strongly against.

jkbonfield commented 6 years ago

Question: how wide should we cast the net? Ideally we want tool developers to comment on what works for them too as without software support it dies.

sjackman commented 6 years ago

I prefer 2 over 5 over 0 over 3 over 1 over 4. With ranked ballots we can determine the outcome of the vote using Condorcet voting. See for example https://www.debian.org/vote/2015/vote_001#outcome