samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
638 stars 174 forks source link

More than 65535 cigar opcodes #40

Closed jkbonfield closed 6 years ago

jkbonfield commented 9 years ago

There was discussion on the samtools-dev mailing list about this last year:

http://sourceforge.net/p/samtools/mailman/message/30672431/

The main crux of the discussion there was to reuse the 16bit bin field to act as the top 16 bits of ncigar, possibly using the top flag bit as an indicator.

There are some other discussions (internal) regarding this, including possibly removing bin completely given that it has no meaning with CSI indices, so this issue is largely just a note to track the issue and collate ideas/fixes.

Note that the problem is definitely a real one. I have hit this with real projects, caused when a user merged consensus sequences from an assembly into a BAM file, but it is also not too far away with actual sequence reads too. Newer technologies (PacBio, ONT, maybe more) offer substantially longer read lengths but also with higher indel rates leading to substantially more cigar opcodes. A 320Kb with with a 10% indel rate would lead to 2 changes (D/I to M) every 10 bases, giving 64k cigar ops. (Those aren't figures for real technologies, but it's not inconceivable.)

lh3 commented 9 years ago

We should implement this feature. I would prefer to keep bin for backward compatibility. This solution is a little awkward, but compatibility seems more important to me.

nh13 commented 9 years ago

We do not use CSI indexes within HTSJDK and Picard, so I would have to think about this in our production pipeline, even as we are moving to CRAM. My apologies for not having a definitive yes/no answer at the moment.

jkbonfield commented 9 years ago

Do you trust and use the bin field though, or recompute it on the fly?

nh13 commented 9 years ago

Looking at our code, it currently reads the bin from from the BAM file. If validation stringency is turned on, which is most of the time, then it checks that the bin is correctly set. If the alignment is modified in memory, then the bin is recomputed before being written. We trust the bin fields since we are most likely the one's that computed the bin field in the first place.

nh13 commented 9 years ago

So looking through our code, we could also not use the bin field, and re-calculate it on-the-fly. We could then implement using a bit in the flag field to say whether or not to use the 16 bits in the bin field. I would suggest we then get rid of it in the specification. Please let us know how you would like us to proceed.

nickloman commented 7 years ago

This is now an active issue with ONT data, I've provided a dataset in FASTA that will routinely run into this limit here when aligned with BWA-MEM:

http://s3.climb.ac.uk/nanopore/ecoli_allreads.fasta

jkbonfield commented 7 years ago

Thanks for the real world examples Nick.

The htslib fix John made above solves this for SAM and CRAM, but not BAM due to file format limitations. I verified this on a CRAM file with 178801 cigar operations and the round trip from SAM -> CRAM -> SAM is identical in the latest develop branch.

Following discussions on samtools-devel several years back https://sourceforge.net/p/samtools/mailman/message/30668698/ I implemented this out of necessity in Staden "io_lib", so scramble can read and write BAM records too with more than 64k entries too. I'm not advocating using that, but it's a practical demonstration that the BAM spec could (and should?) be updated if there is pressure.

In the interim, have you thought of just using CRAM or compressed SAM? Note that you can use CRAM without needing to specify a reference, which may be more efficient with lots of reference differences anyway:

samtools view -O cram,no_ref -o test.cram test.sam
jkbonfield commented 7 years ago

A query on the proposed extra FLAG field. Is this only the BAM format we are changing, or is it SAM also? If SAM, do we therefore expose it? Eg a record with 100,000 cigar ops could have the SAM FLAG as >= 32768.

It feels like this is an internal BAM limitation and it should be in the BAM part of the spec and absent in the SAM part, as this is a binary encoding issue only.

peterjc commented 7 years ago

I would define the FLAG bit as reserved for BAM use as described, and say the bit should be clear in SAM.

tfenne commented 7 years ago

Given that we're likely to see more and more of this going forward, would it be crazy to consider a non-backwards compatible change to the BAM spec along with rev'ing the major version number? I think that would be much preferable in the long run to having a flag field that is reserved for BAM use and whose meaning is questionable for SAM/CRAM use. The latter means that all programs that convert between formats need to be smart about modifying the records prior to writing.

I would much prefer to see this done with a version bump in the spec tied to widening the ncigar field to 32 bits. Then implementations just need to know whether they are parsing a version < n or >= n BAM and expect the field of the appropriate width.

lh3 commented 7 years ago

We have maintained backward and nearly forward compatibility of BAM for over 8 years. In my opinion, we should keep it this way as long as there is a solution -- I am always on the compatibility side. Quite a few tools specifically depend on an old version of samtools or use bamtools, native golang, javascript etc to parse BAMs. @jkbonfield's solution will keep them usable, which I think is very important.

tfenne commented 7 years ago

I guess I'm not sure I follow. Won't all those implementations break given @jkbonfield's solution and a BAM file that actually contains > 65535 cigar operations for at least one record? Unless those implementations (perhaps conditionally on a flag bit) know to use the 16-bits of ncigar plus the 16-bit bin field to make one 32-bit number, they'll just read the exiting 16-bit ncigar field, interpret it as a number < 65535, and then fail when trying to parse the remainder of the cigar as something else.

The only kind of backwards compatibility it yields is that files that don't contain any cigars with ncigar > 65535 would still parse normally. I realize that's not a small fraction of files out there, but the point is that existing implementations would fail in bizarre ways given long cigars with reusing the bin field.

At the very least I think it would make sense to version the specification, and re-write the section that contains the current definition of bin_mq_nl and flag_nc to be a single uint64_t with 32-bits allocated to ncigar (albeit in 2 x 16-bit words). If we did this we: a) Would retain the same level of backwards compatibility b) Have something a little clearer to point to when old implementations fail (are you looking at a V2 BAM?) c) Make it explicit that the bin field is gone d) Have a somewhat less ambiguous spec moving forward

lh3 commented 7 years ago

Yes, forward compatibility is only achievable when there are less than 65536 CIGAR operations, but this is the whole point: we want BAMs without long CIGARs usable by all the old tools, including tools relying on 3rd-party BAM parsing libraries written in other languages.

re-write the section that contains the current definition of bin_mq_nl and flag_nc to be a single uint64_t with 32-bits allocated to ncigar (albeit in 2 x 16-bit words).

I don't think this is possible without breaking compatibility. If I am right, bin and n_cigar are adjacent in memory, but their order is not right. We can't use one 32-bit integer to replace the two 16-bit integers. For compatibility, we can only say higher 16 bits go to this integer and lower 16 bits go to the other.

Make it explicit that the bin field is gone

bin is only gone when there are long CIGARs; otherwise it has the same meaning as before, again for compatibility.

jkbonfield commented 7 years ago

Yes the order was incorrect to be a 32-bit integer (neither big nor little nor even pdp endian). It's all in that email discussion from the original post. I think that also shows it was Heng's idea originally, not mine. I just implemented it. :)

Frankly if we were going to break binary compatibility for ALL data, then we'd change lots of BAM things (eg swapping cigar and name around so we don't have to play games to avoid unaligned memory accesses), but it's just a fools errand. May as well start again with a new format or work on improving the next CRAM iteration than changing the BAM format in a non-compatible manner.

The only problem I see with using a flag is that you never really know whether you'll hit a record that can't cope until you find it, unlike chanigng the format. Hence I'd say we should bump the minor version number and leave the major version as is, keeping the file backwards compatible and stating that any tools that explicitly want to conform to the new minor version have to cope with the additional flag.

I did produce a test file using Scramble to see how samtools/htslib copes; it was badly. It started decoding the remaining CIGAR ops as sequence, emitted a bunch of binary qualities and then returned 0 exit status for good measure. So we know this will break things horribly IF we encounter more than 64k cigar ops, but it's already hopelessly broken in that scenario anyway (re: impossible).

colinhercus commented 7 years ago

I feel a bit stupid asking this but how can I find out the version of SAM spec's I'm reading? I know @HD has a VN tag but looking at several copies of SAMv1.pdf from my archives I haven't seen a minor revision number specified in the document even though it goes through regular updates. 2014 has examples (not specification) of HD with VN:1.5 && VN:1.3, 2016 has just VN:1.5 (in examples) but the document has a lot of changes since 2014.

Could we clearly state in the first 3 header lines what the SAM/BAM version number is for the document.

Thanks, Colin

On 9 March 2017 at 06:36, James Bonfield notifications@github.com wrote:

Yes the order was incorrect to be a 32-bit integer (neither big nor little nor even pdp endian). It's all in that email discussion from the original post. I think that also shows it was Heng's idea originally, not mine. I just implemented it. :)

Frankly if we were going to break binary compatibility for ALL data, then we'd change lots of BAM things (eg swapping cigar and name around so we don't have to play games to avoid unaligned memory accesses), but it's just a fools errand. May as well start again with a new format or work on improving the next CRAM iteration than changing the BAM format in a non-compatible manner.

The only problem I see with using a flag is that you never really know whether you'll hit a record that can't cope until you find it, unlike chanigng the format. Hence I'd say we should bump the minor version number and leave the major version as is, keeping the file backwards compatible and stating that any tools that explicitly want to conform to the new minor version have to cope with the additional flag.

I did produce a test file using Scramble to see how samtools/htslib copes; it was badly. It started decoding the remaining CIGAR ops as sequence, emitted a bunch of binary qualities and then returned 0 exit status for good measure. So we know this will break things horribly IF we encounter more than 64k cigar ops, but it's already hopelessly broken in that scenario anyway (re: impossible).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-285192561, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-l6gN6u6I84BfcfMVjgiNoCDmt0Iks5rjy1kgaJpZM4ChFgT .

jkbonfield commented 7 years ago

Agreed it's unclear. It would be friendlier to be in the main title of the document. I suspect the version with examples containing different HD lines was an accident, but clearly this is best avoided by putting the actual version at the top rather than in an example.

Largely the changes boil down to two things - substantive changes in file contents and/or meanings, and minor language adjustments or clarifications. The latter do not change the minor version number as the file format is unchanged, however it would be good if there was some sub-revision counter somewhere so you know which era of document you are reading. Note also that SAM auxiliary tags have now been moved to their own document. Previously adding new tags would likely have meant increasing the SAM minor version number.

It gets more confusing though with BAM. BAM has a magic number "BAM\1" and no minor version number at all. I assume it is expected to be gleaned from the "HD" header, but this isn't a required field.

For the purposes of this proposed change, it will be challenging unless we go down the route of losing forwards compatibility and switch to "BAM\2". The SAM specification should not, IMO, be adjusted to indicate the size change of ncigar or add the flag as this is purely an internal BAM representation and not a change needed to the SAM specification itself (which is entirely textual). Mind you, the same is true for the bin field too - it's entirely a BAM optimisation and not in SAM.

colinhercus commented 7 years ago

A consumer of BAM files that is not programmed to handle the long CIGARS will break when it gets to the first record with a long CIGAR. I feel that's a bit late and very nasty.

Changing the magic number might be an easy way to tag the BAM as having the extended length CIGAR option. Producers of extended length CIGARs just need to set the new magic number. Those doing SAM to BAM that don't know at outset if they have long CIGARs could take an extra option to allow long CIGARS and generate the new magic string. These programs need to be updated anyway.

Consumers won't recognise the new magic number unless they've been updated to handle it.

On 9 March 2017 at 17:48, James Bonfield notifications@github.com wrote:

Agreed it's unclear. It would be friendlier to be in the main title of the document. I suspect the version with examples containing different HD lines was an accident, but clearly this is best avoided by putting the actual version at the top rather than in an example.

Largely the changes boil down to two things - substantive changes in file contents and/or meanings and minor language adjustments or clarifications. The latter do not change the minor version number as the file format is unchanged, however it would be good if there was some sub-revision counter somewhere so you know which era of document you are reading. Note also that SAM auxiliary tags have now been moved to their own document. Previously adding new tags would likely have meant increasing the SAM minor version number.

It gets more confusing though with BAM. BAM has a magic number "BAM\1" and no minor version number at all. I assume it is expected to be gleaned from the "HD" header, but this isn't a required field.

For the purposes of this proposed change, it will be challenging unless we go down the route of losing forwards compatibility and switch to "BAM\2". The SAM specification should not, IMO, be adjusted to indicate the size change of ncigar or add the flag as this is purely an internal BAM representation and not a change needed to the SAM specification itself (which is entirely textual). Mind you, the same is true for the bin field too - it's entirely a BAM optimisation and not in SAM.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-285304839, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-l4cVeRPfJB323lupgW--0328RURks5rj8r4gaJpZM4ChFgT .

tfenne commented 7 years ago

@colinhercus The problem with setting the magic number that way is that producers have to know in advance whether they will produce such a record, and for a lot of tools that may not be the case either. E.g. a generic tool that converts SAM to BAM won't know until it hits the read.

There are other options we could consider, rather than just going with the one solution proposed. E.g. we might consider that if you have > 65k cigar operations that the right way to handle it is to push the remaining cigar operations into an optional tag, and either terminate the cigar field with a soft-clip that covers the remaining bases, or a new symbolic cigar operator that would a) inform up to date tools that there is more cigar elsewhere, and b) cause existing tools to terminate with an error because they don't recognize the cigar operator.

Re: ordering of bits and the bin field, I think I wasn't clear enough. I was suggesting two things: 1) If there are situations in which the bin field cannot be relied upon in all cases, the sanger folks already don't seem to trust it, and it can be re-computed from the record, I would prefer going forward to not have bin in the format description. 2) I was suggesting re-writing the document to collapse those two descriptions of 32-bit chunks of memory to a single description of a 64-bit chunk of memory, and then to document more clearly how each byte or word is used, following the solution you guys outlined to use two 16-bit words for ncigar.

jkbonfield commented 7 years ago

If we were to bump the BAM version number then I'd say the default should be the current BAM\1 value (unless the input data was BAM\2, in which case retain that) with command line switches to specifically request the newer format. That keeps the vast bulk of data in a well supported format. It's still problematic though.

There is certainly merit in using auxiliary fields to work around this issue in a completely compatible manner - a nice idea. However it migrates the problem from being BAM specific to also being SAM and CRAM oriented unless we simply patch the issue in BAM only: when BAM encounters a SAM cigar it cannot handle it stores it in a new tag, and on decode it reverses this so that it is transparent to the users. This can also make it transparent to the APIs.

Regarding bin, it cannot be relied on in some cases at least; eg when using CSI as you have an extra large chromosome. That means it's unreliable and best off computed on the fly using whatever index algorithm you have. We've had subtle bugs as well (cough, my bugs in cram!) which filled it out incorrectly in a few boundary cases.

colinhercus commented 7 years ago

@Timfennell

"@colinhercus https://github.com/colinhercus The problem with setting the magic number that way is that producers have to know in advance whether they will produce such a record, and for a lot of tools that may not be the case either. E.g. a generic tool that converts SAM to BAM won't know until it hits the read."

This is why I suggested adding an option to these programs so that they knew in advance. Obviously if the option wasn't set then an error "CIGAR Exceeds 65535" would be reported. Some producers could default to the new format and always produce bam\2 and their BAMs wouldn't be accepted by consumers that don't recognise the new format.

The advantage is that consumers that haven't been updated will not recognise the magic number and will not accept the file rather than crashing when they hit the first read with the new format.

They same undefined behaviour will happen with any consumer that doesn't understand the new format, what ever format you choose and it's unrealistic to think all consumers will be updated promptly.

On 9 March 2017 at 23:49, Tim Fennell notifications@github.com wrote:

@colinhercus https://github.com/colinhercus The problem with setting the magic number that way is that producers have to know in advance whether they will produce such a record, and for a lot of tools that may not be the case either. E.g. a generic tool that converts SAM to BAM won't know until it hits the read.

There are other options we could consider, rather than just going with the one solution proposed. E.g. we might consider that if you have > 65k cigar operations that the right way to handle it is to push the remaining cigar operations into an optional tag, and either terminate the cigar field with a soft-clip that covers the remaining bases, or a new symbolic cigar operator that would a) inform up to date tools that there is more cigar elsewhere, and b) cause existing tools to terminate with an error because they don't recognize the cigar operator.

Re: ordering of bits and the bin field, I think I wasn't clear enough. I was suggesting two things:

  1. If there are situations on which the bin field cannot be relied upon in all cases, the sanger folks already don't seem to trust it, and it can be re-computed from the record, I would prefer going forward to not have bin in the format description.
  2. I was suggesting re-writing the document to collapse those two descriptions of 32-bit chunks of memory to a single description of a 64-bit chunk of memory, and then to document more clearly how each byte or word is used, following the solution you guys outlined to use two 16-bit words for ncigar.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-285389630, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-rFXquCdCViVPjyFR9gIc20W4QmQks5rkB90gaJpZM4ChFgT .

lh3 commented 7 years ago

I am reviving this thread again. With the release of NA12878 ultra-long nanopore reads, long CIGARs are becoming more frequent. After some thoughts, I now prefer the following solution as is suggested by @tfenne. When converting an alignment with >65535 CIGAR operations to a record in BAM file, we mark the alignment as unmapped (flag 0x4), set bam1_core_t::n_cigar to zero and move the entire CIGAR string to a tag CG:B,I. When newer htslib sees an unaligned record with a CG:B,I tag, it seamlessly generates the right CIGAR from the CG tag and removes the tag while reading. Normal htslib users would not notice any differences between short and long CIGARs. Older htslib will take records with >65535 CIGAR operations as unmapped and won't crash. Older htslib loses such long alignments, but it can't work with them anyway.

An alternative to CG:B,I is to simply use CG:Z, storing long CIGARs in their text form. I mildly prefer CG:B,I because it makes code faster (moving a long CIGAR can be achieved with memmove calls) and we usually don't see this tag written to a SAM file.

[EDIT for clarity]

peterjc commented 7 years ago

@lh3 you said in SAM, but I presume you mean in BAM here?:

"For an alignment with a long CIGAR in SAM, we mark the alignment as unmapped (flag 0x4), set ..."

i.e. There is no change when such long mapped reads are stored in SAM format which has no limits on the number of CIGAR operators.

[update - fixed]

lh3 commented 7 years ago

@peterjc see edit for clarity.

peterjc commented 7 years ago

@lh3 that's better now, thanks!

jkbonfield commented 7 years ago

Thumbs up to this idea.

I think it's a workable way of addressing the problem in BAM. For SAM and CRAM no change needs to be made. I agree CG:B,I is appropriate as this field should never be present in SAM or CRAM, only BAM (we could even state that in the specification) in which case the normal BAM specific encoding is the natural way to store it.

As per the SAM spec:

If [flag] 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ and bits 0x2, 0x100 and 0x800

By setting flag 0x4 we aren't technically breaking the specification, but we are adding some meaning to those fields as the flag bit field may be removed, resurrecting those other fields once more. We'd need some footnote to explain this caveat.

yfarjoun commented 7 years ago

unmapped will not work since various formats (CRAM?) will throw away mapping information if they see an unmapped read...so you'll lose reference, position, strands...not good.

So I'm seeing a problem where we first write to bam and then convert to cram. There will need to be some logic introduced to capture the difference in behavior. Can't we leave the read as mapped (which is more correct) but use a null CIGAR? Is that legal?

On Wed, Jul 5, 2017 at 11:36 AM, James Bonfield notifications@github.com wrote:

Thumbs up to this idea.

I think it's a workable way of addressing the problem in BAM. For SAM and CRAM no change needs to be made. I agree CG:B,I is appropriate as this field should never be present in SAM or CRAM, only BAM (we could even state that in the specification) in which case the normal BAM specific encoding is the natural way to store it.

As per the SAM spec:

If [flag] 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ and bits 0x2, 0x100 and 0x800

By setting flag 0x4 we aren't technically breaking the specification, but we are adding some meaning to those fields as the bit field that wasn't previously there. We'd need some footnote to explain this caveat.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-313140465, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0vsLodH9GyQAiRZztG_Y3CRPMARJks5sK61wgaJpZM4ChFgT .

lh3 commented 7 years ago

@jkbonfield Note that when converting a BAM with the CG tag to SAM, older htslib will write this tag to SAM because it doesn't know this is a special tag intended for BAM only. For such a SAM, it would be good if the SAM-to-CRAM converter can move the real CIGAR at the CG tag back to the right place.

@yfarjoun CIGAR can't be null for mapped reads. New htslib/htsjdk should remove this 0x4 flag once the alignment is read into memory. It is ok if old htslib/htsjdk loses alignments with long CIGARs.

jkbonfield commented 7 years ago

@yfarjoun the problem doesn't exist in SAM or CRAM as they can cope with >65535 cigar ops. The point here is on reading BAM the API will automatically reverse the bam-only work around so converting BAM to CRAM will regenerate the correct data anyway.

The only issue that arises is using old software to convert worked-around-BAM to CRAM as that may lose some data. It'll retain the auxiliary fields and so be able to regenerate CIGAR again, plus reference and position is obviously retained too as CRAM can handle "placed but unmapped" data, but I don't recall what happens with other flag bit fields like strand. Edit "may lose" - I'm not actually sure it "will lose". We'd have to experiment on that one.

lh3 commented 7 years ago

Here is an example SAM file. It contains only two alignments. The first has 67,601 CIGAR operations. The second has a short CIGAR. You can use it for testing.

lh3 commented 7 years ago

I have submitted a PR (samtools/htslib#560) on the latest proposal.

sjackman commented 7 years ago

In a BAM file, I suggest marking the read with a long CIGAR string as mapped, setting the CIGAR field to *, and storing the CIGAR operations in the CG optional tag as suggested above. The SAM file needn't be adapted at all, as it can already support long CIGAR strings.

Older htslib loses such long alignments, but it can't work with them anyway.

Silently ignoring just some alignments but not all could result in surprising, unexpected behaviour for those users with an outdated htslib. I'm not terribly keen on co-opting the unmapped flag in this manner.

lh3 commented 7 years ago

With my proposal, older samtools and htslib should work in the desired way: they (at least the sam/bam part) will simply carry the CG tag and all chr/pos/flag/mapq/tags along with all operations. Older versions can't access records with >64k cigars, but they won't throw data away. Whenever you process the BAM with a proper htslib, you will get all information back.

As to other libraries, if they follow the spec, they are expected to have a similar behavior to htslib (though as @jkbonfield pointed out, the spec doesn't say about the strand bit).

I suggest marking the read with a long CIGAR string as mapped, setting the CIGAR field to *

This will crash older samtools and htslib as I remember, and will lead to undefined behavior according to the sam spec.

yfarjoun commented 7 years ago

I really think that marking a read as unmapped in order to encode a large cigar is a bad idea.

👎

It leads us down the wrong direction by making the program need to make complicated inferences. For example: To check that a read is valid, normally one would (among many other things) need to check that "MateUnmapped" in the read matched "readUnmapped" in the mate. With this suggestion I don't even know what the MateUnmapped should be...

Here are other options that do not encode the read incorrectly:

  1. Use a sam-flag (I know that we only have a few left...but it seems better than misusing an existing one...)
  2. Use a 0 or 65535 in the length of the cigar (in BAM) to imply that there's a CG tag that needs to be viewed.
  3. Use a new cigar operator ("Z"? "-"? "~" whatever! ) to mean "there are extra cigar operations in the in the CG tag, look there too!"

I prefer a method that will break old code (in the presence of a large CIGAR) because I do not want an old program to silently mis-process files with large cigars (as the suggestion here will cause)

colinhercus commented 7 years ago

Even though Hengs suggestion may not crash older programs it could produce incorrect results. Novosort MarkDuplicates (Perhaps there's no need to mark duplicates on nanopore reads but I think that's irrelevant) will see unmapped flag and grab first 15bp of the read sequence as the signature. Even just sorting without mark duplicates we expect (insist) an unmapped read with a POS & RNAME matches it's mate including the strand. I need to check what would happen with a single end read that is unmapped with RANME&POS set but we may just sort it to the end with other unmapped reads. Mark duplicates also checks CIGAR for soft clipped bases so marking as mapped with zero length CIGAR will not function the way it's intended. I would prefer a change that breaks old code and that the break happens very early in processing, not when we hit the first read with a long CIGAR.

On 10 July 2017 at 09:53, Yossi Farjoun notifications@github.com wrote:

I really think that marking a read as unmapped in order to encode a large cigar is a bad idea.

👎

It leads us down the wrong direction by making the program need to make complicated inferences. For example: To check that a read is valid, normally one would (among many other things) need to check that "MateUnmapped" in the read matched "readUnmapped" in the mate. With this suggestion I don't even know what the MateUnmapped should be...

Here are other options that do not encode the read incorrectly:

  1. Use a sam-flag (I know that we only have a few left...but it seems better than misusing an existing one...)
  2. Use a 0 or 65535 in the length of the cigar (in BAM) to imply that there's a CG tag that needs to be viewed.
  3. Use a new cigar operator ("Z"? "-"? "~" whatever! ) to mean "there are extra cigar operations in the in the CG tag, look there too!"

I prefer a method that will break old code (in the presence of a large CIGAR) because I do not want an old program to silently mis-process files with large cigars (as the suggestion here will cause)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-313982676, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-j3ECwpqmi87_Cj-O_h5JDenyOh8ks5sMYQrgaJpZM4ChFgT .

lh3 commented 7 years ago

First of all, as I said, htslib and samtools will have the desired behavior. If you follow the sam spec, your implementation is likely to have the right behavior, too.

For example: To check that a read is valid, normally one would (among many other things) need to check that "MateUnmapped" in the read matched "readUnmapped" in the mate. With this suggestion I don't even know what the MateUnmapped should be...

If this is your only concern, you don't need to worry too much because it is very unlikely that long reads will have mates. When long reads get mates some day, we will probably need to revamp SAM. @colinhercus: the same is applied to MarkDuplicates.

Use a sam-flag (I know that we only have a few left...but it seems better than misusing an existing one...)

Do you mean a bam-only flag? This leads to an undefined behavior. Different tools may have different reactions. Older samtools/htslib silently reads bits not defined in the spec.

As to the other two options, remember that this is a BAM-only change. The BAM reader reads a blob of data from disk and usually does not decode it until the blob gets used. When you create a BAM record not interpretable by old tools, you may crash them in bizarre ways depending when they check the integrity and how they access the blob, which is worse than my proposal.

I prefer a method that will break old code (in the presence of a large CIGAR) because I do not want an old program to silently mis-process files with large cigars (as the suggestion here will cause)

The only relatively safe way (still not guaranteed way) to crash old tools is to use a new BAM magic. However, as is discussed above, you don't know if the input stream contains a long CIGAR or not, so you can't write "BAM\2" before hand. In addition, my proposal is almost working (is working with old htslib/samtools). I don't see the need to change the BAM magic number.

lh3 commented 7 years ago

It leads us down the wrong direction by making the program need to make complicated inferences. For example: To check that a read is valid, normally one would (among many other things) need to check that "MateUnmapped" in the read matched "readUnmapped" in the mate. With this suggestion I don't even know what the MateUnmapped should be...

@yfarjoun actually, you don't need to worry about the logic of mate flags at all. This artificial 0x4 flag is on disk only. For new tools, a BAM record will NOT have the 0x4 flag once it is read into memory. All you need to change is the low-level BAM reader (done in ~50 lines for htslib). No high-level code needs to be touched.

PS: more explanations:

In memory, a record has a long cigar and does not have this artificial 0x4 flag. In my PR, htslib writes 0x4 to disk, but does not modify the record in memory. When you read a BAM record with artificial 0x4, htslib seamlessly removes it and gives you an in-memory record as if BAM supports long cigar. In all, end users and high-level code never see artificial 0x4 flags. Your mate flag example never occurs if you modify htsjdk the same way as htslib.

colinhercus commented 7 years ago

The problem isn't new tools that have been updated to handle this hack but old tools that may do the wrong thing with these BAMs.

On 10 July 2017 at 11:53, Heng Li notifications@github.com wrote:

It leads us down the wrong direction by making the program need to make complicated inferences. For example: To check that a read is valid, normally one would (among many other things) need to check that "MateUnmapped" in the read matched "readUnmapped" in the mate. With this suggestion I don't even know what the MateUnmapped should be...

@yfarjoun https://github.com/yfarjoun actually, you don't need to worry about the logic of mate flags at all. For new tools, a BAM record will NOT have the 0x4 flag once it is read into memory. All you need to change is the low-level BAM reader (done in ~50 lines for htslib). No high-level code needs to be touched.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-313995720, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-s9FwcJbYEUbR9SwnQg0wuxUEgYxks5sMaBSgaJpZM4ChFgT .

lh3 commented 7 years ago

@colinhercus The mate flag example raised by @yfarjoun is really about new tools. For old tools, I am certain we won't see long reads with mates in the next 5 years (probably forever). When we have such data, most users will have migrated to new versions and won't have issues.

We have a practical problem with BAM and we need a solution right now. I admit my solution is not perfect, but I don't see better ones. We shouldn't block the issue only because of hypothetical data types that have never been produced practically.

colinhercus commented 7 years ago

We can update our programs to handle whatever you decide but we will have to QA our existing programs on this new data to make sure there's no unexpected surprises.

On 10 July 2017 at 12:33, Heng Li notifications@github.com wrote:

@colinhercus https://github.com/colinhercus The concern raised by @yfarjoun https://github.com/yfarjoun is really about new tools. For old tools, I am certain we won't see long reads with mates in the next 5 years (probably forever). When we have such data, most users will have migrated to new versions and won't have issues.

We have a practical problem with BAM and we need a solution right now. I admit my solution is not perfect, but I don't see better ones. We shouldn't block the issue only because of hypothetical data that have never been produced practically.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/40#issuecomment-313999512, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcc-qIAySJ2i4mYV0zadFx0VTwj6rDCks5sMamOgaJpZM4ChFgT .

sjackman commented 7 years ago

@yfarjoun wrote…

I prefer a method that will break old code (in the presence of a large CIGAR) because I do not want an old program to silently mis-process files with large cigars (as the suggestion here will cause)

I agree.

@lh3 wrote…

First of all, as I said, htslib and samtools will have the desired behavior. If you follow the sam spec, your implementation is likely to have the right behavior, too.

Although this may be true of samtools view and other simple tools, it won't be true of most tools. An old implementation of samtools mpileup will ignore those reads that have long CIGAR strings if they're marked as unmapped (I believe, correct me if I'm wrong here), and produce seemingly correct results. The analysis will have ignored some fraction of the alignments that happened to have long CIGAR strings. It will silently produce results that look okay at first glance, but are in fact wrong.

I'd prefer that old tools run on a new BAM file produce an error. I suggest setting the BAM CIGAR length to 0. This 0 CIGAR length will break old implementations that try to analyze a file that contain a long CIGAR, which I believe is desirable. The old implementation will hopefully produce an error message about a 0-length CIGAR being invalid. If the program crashes, I think that's better than silently producing incorrect results.

Bumping the SAM/BAM minor version number would indicate that this BAM file may (but not necessarily) contain long CIGAR strings encoded in the CG option.

jkbonfield commented 7 years ago

@yfarjoun If you prefer a method that breaks old cold, then really the best way to do that is simply changing the BAM version number and fix it properly, which forces all old code to fail at the first hurdle! However I think the general consensus is that would be bad.

Looking more at this I think I now prefer @sjackman's suggestion of using CIGAR * as this already fits within the specification. * marks it as unavailable, which for BAM is true in this field as it cannot be encoded. It also solves the problems I've already raised regarding the meaning of the other bit fields which according to the specification become undefined (or implementation defined) the instant we set the unmapped flag. I don't see the need of adding a new flag - this is entirely a BAM encoding problem and not related to the SAM specification. Equally so adding Z to cigar seems unnecessary too.

@lh3 I am unsure of where this would break older htslibs, but if it does then so be it - that's a good thing as they'd have to be pretty ancient I think. Conceptually, cigar * or length 0 (the same thing btw) feels like the right thing to do and is completely within the current specification. The only extension therefore is the additional tag as an alternative encoding.

@colinhercus This wouldn't affect duplicate marking or sorting if the in-memory copy of the data is amended on the fly. Eg consider the process as read BAM record (implicit fix CG->CIGAR) -> process -> output BAM record (implicit fix CIGAR->CG). Recompiled code wouldn't even be able to tell the cigar field has been saved elsewhere. All other solutions would also require changing code and recompiling, so this is no worse.

jkbonfield commented 7 years ago

Regarding cigar * maybe you (@lh3) are recalling this commit: samtools/htslib@4c8fb6661f375690aa8ec85278fba830c302b3bb

However that was a flaw in my CRAM encoding from 2015. Given the relative youth of CRAM compared to SAM/BAM, if you're using CRAM with a 2 year old implementation then you're probably being foolhardy.

There was a much earlier commit by you in samtools/samtools@140d53dfdfe32bfa3ae4b24f1f33d071f366054f which appears to have (incorrectly) equated CIGAR * to be unmapped. I'm having trouble finding where that vanished, but it looks to have been sometime during the samtools to htslib migration. We're looking at 3-4 years ago I think. (Edit: see below)

Edit: I've now found when it appeared in htslib - the warning message had changed text which is why I couldn't find it. I think the first occurrence of this code appearing in htslib with the mysterious "code backup; NOT compiling" commit! samtools/htslib@b463806ef5bc5dcaae7a8663f396d109d656c7a2

Sadly though, it appears this error still exists in htslib, so it's not even old problem. Current htslib develop forcibly adds flag 0x4 to any read with CIGAR *. This is an error. The spec states this as meaning "unavailable", not "unmapped"!

jkbonfield commented 7 years ago

Sadly Picard also can't handle CIGAR * too:

Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing text SAM file. CIGAR must not be '*' if RNAME is specified; File a.sam; Line 2

It's probably still the right thing to do, even if it does mean people have to upgrade their software to cope, as realistically they would need to do that anyway to get meaningful results from the long-read data.

tfenne commented 7 years ago

I'm largely in agreement with @yfarjoun and @colinhercus who, I think, are arguing against using the unmapped flag to indicate this situation. @jkbonfield I think there are three options though:

  1. Encode in such a way that old tools just ignore reads with cigar > 65535 elements
  2. Encode in such a way that old tools fail outright on any new BAM
  3. Encode in such a way that old tools fail if they encounter a BAM file containing reads with cigars with > 65535 elements

I dislike (1) for the reasons others state above - I would prefer that an old tool that cannot handle the data within a BAM file fail rather than silently ignore data. Regarding (2) & (3) I'm on the fence. We've avoided making a non-backwards compatible change for ages now, and it might be worth opening that conversation up to ask what else we might fix if we changed the magic number and version number.

For (3) I like both options of a) having the cigar be empty in BAM (i.e. 0-length or *) and the whole thing being in a separate tag, or b) having the first 65534 elements in the cigar field followed by a new operator/element (e.g. >) which could either have length 0 or the sum of the length of the remaining cigar operators, while the remainder of the cigar is pushed off into a tag.

lh3 commented 7 years ago

I am ok if old tools crash when it sees long cigar in BAM, but I don't think there are effective ways to achieve that (i.e. option 3). Both htslib and htsjdk do enough validations when parsing SAM. You can easily block their SAM parser. However, when parsing BAM, at least htslib doesn't check integrity. If htslib only needs chr and pos, it will read through the BAM as if it is all good. If htslib needs to decode CIGAR and sees n_cigar==0 for a mapped record, I don't know whether it will fail or where it will fail. This is tool dependent, version dependent and library dependent because we are dealing with an undefined behavior.

We have to solve the long-cigar issue this round because we are getting such data. I suggest we set a deadline (e.g. two weeks?) and choose the best solution by then. The earlier we incorporate this change to the spec/htslib, the less likely that future end users will be hit by this forward compatibility issue. @yfarjoun, @sjackman and @tfenne: I am fine for long cigars to crash htslib, but your suggestions so far won't work as I stated above. You need to come up with a consistent way to crash the htslib BAM parser (not only the SAM parser).

yfarjoun commented 7 years ago

a non-compliant cigar operator will not crash htslib?

lh3 commented 7 years ago

a non-compliant cigar operator will not crash htslib?

Not if only chr/pos/flag is needed. For other functionality, we need to check the source code. In some places, it might only lead to bad memory access which does not always cause a straight segfault. At some other places, a new CIGAR might be recognized as 'M', due to the BAM_CIGAR_TYPE flag (EDIT: no, not 'M'. When BAM_CIGAR_TYPE is in action, a new cigar operator will behave like 'P', 'H' and 'B' operators. I have little idea about what this will cause in downstream code).

yfarjoun commented 7 years ago

in that case, I'd like to modify @tfenne 's list to the following options:

  1. Encode in such a way that old tools just ignore reads with cigar > 65535 elements
  2. Encode in such a way that old tools might fail outright on any new BAM
  3. Encode in such a way that old tools might fail if they encounter a BAM file containing reads with cigars with > 65535 elements

IOW, failure of old tools should not be a requirement of the solution, but we shouldn't avoid such a solution if that is a risk. So I do not think that the mandate

You need to come up with a consistent way to crash the htslib BAM parser (not only the SAM parser).

is valid.

A bigger risk in my mind is that we start amassing data which is incorrect (e.g. a mapped read with an unmapped flag) and breaks a basic assumption in the spec:

Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800.

===

In terms of the different options that were suggested, (but please keep them coming, if people have more ideas) I think that no data is a better than partial data (in a specific read) so I would prefer the parser to have no CIGAR data for a read, than to read the first 65K elements and neglect to read the remaining few because we split the cigar across multiple places. Thus, my preference would be that in case of a long cigar, it be placed in a tag in it's entirety, and that the traditional cigar spot be marked with size 65535 but that would be code for "look elsewhere". This means that if a program misinterprets this and tries to parse 65K cigar operators where there are none, it will get garbage and will soon crash. The actual cigar, will then be in the CG tag. I think that this is option 3.

While I agree that we should be compiling a list of breaking spec changes we would like to introduce to the next version of BAM, I'm not sure that this particular issue should wait on that. (option 2, I think)

jkbonfield commented 7 years ago

a non-compliant cigar operator will not crash htslib?

It won't crash, but it will hard error. A SAM file with 100Z (just an example) shows:

[E::sam_parse1] unrecognized CIGAR operator
[W::sam_read1] parse error at line 2

On a BAM file, recent htslibs will turn unknown characters to ?, but continues without error unless something happens to subsequently use CIGAR and fails. Prior to the fix for samtools/htslib#546 these had the potential to read past the fixed BAM_CIGAR_STR and return a more arbitrary and random looking character instead.

Personally I dislike adding a new character because largely I see it as unnecessary. It's purely to work around a local BAM format issue and shouldn't really be requiring SAM spec changes.

Also, the fact that (wrongly IMO) htslib changes mapped (according to flag 0x4) to unmapped when it sees CIGAR "", disobeying the statement in the SAM spec that the flag is the primary source and everything else follows on from that, it will transparently accidentally turn @sjackman's proposal (cigar ) into @lh3's proposal (unmapped flag). Whether that's a good thing or not I don't know!

lh3 commented 7 years ago

Setting n_cigar==65535 is the best option to crash htslib so far, but it is not good enough. For functionality that only looks at cigar, htslib won't simply crash. It will read into SEQ and QUAL and return absurd alignment lengths. What will happen next depends on the downstream code. You might get weird results without seeing a segfault. We wouldn't want this happen.

Regardless of what the spec say, we are dealing with the combination of 0x4 and cigar=="*" everyday: an unmapped mate is flagged with 0x4, may be on reverse strand and has coordinate but not cigar. This is almost the same as the use of 0x4 in my proposal except the pairing flag. Most tools that work with unmapped mates will have the desired behavior in my proposal.