samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
658 stars 172 forks source link

Developing a StrictSam standard #281

Open yfarjoun opened 6 years ago

yfarjoun commented 6 years ago

As discussed on the last and penultimate call, there's a desire to standardize a "Strict Sam" specification.

There's already a stub of an implementation here: https://github.com/d-cameron/hts-specs/blob/strict/SAMstrict.tex

@d-cameron

I didn't know how to respond to your request that we take a look at your initial branch...so I am opening this issue for a discussion...if you have a better idea, I'm game.

yfarjoun commented 6 years ago

I think that overall the start is good. I think that it is overzealous at places (for example forbidding @SO and @GO together, and also the requirement that queryname be used as a @GO and not a @SO, among others...) but that should be a discussion when a PR is prepared.

I think it would be good to agree on an identifier for each "rule" some thing like

Every SAM record with an non-* RNAME must have a corresponding @SQ header with matching SN. (NON_STAR_RNAME_MUST_HAVE_SN_IN_HEADER)

So that

Also, in the call, we discussed tests. I'm wondering if we should have SAM files (as individual files!) that can be used by validation program authors to check their programs. It should be clear, for each SAM file which validations are expected to fail (if any)

jkbonfield commented 6 years ago

I'd like to see levels of strictness. Something like INVALID, NON_COMPLIANT and INCOMPLETE.

There are clearly some things which we can define as broken / invalid as it's basically nonsensical.

Non-compliance is examples of auxiliary tags in the reserved address space. Sadly this is very common, hence why it should be a separate category. Similarly there is some looseness around the definition of TLEN; the specification is quite clear, but a huge portion of commonly used programs do something subtly different. While technically these are failures, if we label all these as invalid then we lose discrimination power as most data in NCBI / ENA is now labelled as invalid - not very useful in a diagnostic tool.

For incomplete data we have valid SAM, but one that may perhaps cause problems for fragile tools. For example

I view these largely as groupings of the failure-ID Yossi listed above (NON_STAR_RNAME_MUST_HAVE_SN_IN_HEADER etc). We could explicitly give a list of the strict IDs we want to check or exclude, or instead give a category which is essentially an "OR" operation on a large set of IDs.

d-cameron commented 6 years ago

One of the reasons I requested feedback was because I wasn't sure what the best way to label the errors. Some options are:

I'm currently leaning towards a hierarchy (matching the document format hierarchy) with tags as well.

I think that it is overzealous at places (for example forbidding @SO and @GO together, and also the requirement that queryname be used as a @GO and not a @SO, among others...)

That particular one I pulled straight from Section 2 (Recommended Practice for the SAM Format) of the SAM specs.

I'd like to see levels of strictness. Something like INVALID, NON_COMPLIANT and INCOMPLETE.

The use case I had in mind was for program/archive input validation. A given program would specify the conformance rules that it requires. BreakDancer would require basic sanity checks and mate record consistency checks to pass but not care about any of the standard tags checks since it doesn't use them. GRIDSS would be much more particular and also require all StandardTags.SA.* checks to pass. EMBL might care only about completeness and data consistency of primary alignments.

Similarly there is some looseness around the definition of TLEN; the specification is quite clear, but a huge portion of commonly used programs do something subtly different. While technically these are failures, if we label all these as invalid then we lose discrimination power as most data in NCBI / ENA is now labelled as invalid - not very useful in a diagnostic tool.

Even without including problematic checks such as TLEN (which incidentally you can't properly check since an indel-aware program can write a correct TLEN that doesn't match either of the simple definitions), I expect most data to fail at least one check and that won't actually be problematic.

I envisage something like the following:

$validatesam --summary mybam.bam [RED] 5 headers are malformed [RED] 1,000 of 1,000,000,000 records fail basic sanity checks #SanityCheck [ORANGE] 5 headers are missing required fields as per specifications [ORANGE] some other quite bad class of error [BLUE] 1 record has a bad tag that most people don't use [PURPLE] file uses undefined standard tags QQ [GREEN] 1,000,000,000 of 1,000,000,000 have mismatched TLEN

I'll keep fleshing out more rules then make a PR and we can have an in-depth discussion how to name and organise them.

jkbonfield commented 6 years ago

We have two distinct things here.

  1. A new format derived from SAM but with additional rules.
  2. A validator for the existing format.

The first case I'm uneasy about. I'd be happy with something that states if a read claims its mate is reverse complemented but its mate isn't, then that is an inconsistency (this is arguably still point 2; a semantic validation on top of the syntacially validation). However something entirely new like requiring that CIGAR must have at least one operator (disallowing "*") is turning semantically valid SAM files into invalid ones purely for ease of program coding. I don't see this gaining much traction. Better would be tightening up SAM spec itself, where appropriate, with a corresponding version bump.

Regarding the second point only, some rules need to be tied to others. For example, if we do not know what version of the SAM specification we have then we cannot validate the header tags correspond to the version. "AH" SQ tag appeared in 1.5 and is therefore invalid in a file with "HD VN:1.4" but not invalid in one without an HD line.

Even for StrictSAM, we'd need these rules tied to the SAM version numbers. The should be annotated as such in the text, eg "[1.4]" meaning 1.4 and newer or "[=1.4]" being precisely 1.4 only and not anything else (I doubt we'd need that though). We also need to be clear which case we're talking about - a SAMStrict only rule or a SAM validation rule.

Comments:

yfarjoun commented 6 years ago

It might be useful to tag SamStrict constraints that might be violated by subsetting. That would enable a smarter validation tool that can be told to ignore that set of restrictions, during the parallel part of the pipeline. then, when the final file is put together one could run the validator again without that flag.

On Wed, Jan 31, 2018 at 7:02 AM, James Bonfield notifications@github.com wrote:

We have two distinct things here.

  1. A new format derived from SAM but with additional rules.
  2. A validator for the existing format.

The first case I'm uneasy about. I'd be happy with something that states if a read claims its mate is reverse complemented but its mate isn't, then that is an inconsistency (this is arguably still point 2; a semantic validation on top of the syntacially validation). However something entirely new like requiring that CIGAR must have at least one operator (disallowing "*") is turning semantically valid SAM files into invalid ones purely for ease of program coding. I don't see this gaining much traction. Better would be tightening up SAM spec itself, where appropriate, with a corresponding version bump.

Regarding the second point only, some rules need to be tied to others. For example, if we do not know what version of the SAM specification we have then we cannot validate the header tags correspond to the version. "AH" SQ tag appeared in 1.5 and is therefore invalid in a file with "HD VN:1.4" but not invalid in one without an HD line.

Even for StrictSAM, we'd need these rules tied to the SAM version numbers. The should be annotated as such in the text, eg "[1.4]" meaning 1.4 and newer or "[=1.4]" being precisely 1.4 only and not anything else (I doubt we'd need that though). We also need to be clear which case we're talking about - a SAMStrict only rule or a SAM validation rule.

Comments:

-

Some obvious things which are simply restating the main SAM spec, such as SQ tags SN and LN being mandatory and matching specific regular expressions. These are fine and it's good to be able to give a failure a symbolic name.

In the StrictSAM you make the HD line mandatory. I'd have loved this to be so when SAM was designed, but sadly it wasn't done and we have thousands of files that this rule would make non-conformant. I'd prefer to have a new SAM version where, say, 1.6 onwards HD is mandatory. Old files don't get validated so rigorously, but we could run a tool that converts them to 1.6 and does corresponding validation. That way we get this into the real spec rather than a sibling spec. The same applies for many other strict rules. For example currently we have "If RNAME is , no assumptions can be made about POS and CIGAR". I think it's justified for a new SAM spec version number to remove the "no assumptions" rule and simply dictate them as 0 and .

Limiting SQ length to 2Gb is arbitrary. Similarly for POS. SAM has no limit as it is entirely textual. BAM and CRAM do, but I could easily see a CRAMv4 where these are 64-bit fields instead. I see no reason to limit the textual version. If we wish to have a BAM validator then that is a subtly different thing. Perhaps this rule needs to come under a compatability / warning level rather than something strict.

Please use symbolic names for the FLAG values instead of 0x1, 0x2 etc. We can define them in a table before referring to them (suggest the same ones used in various bits of software).

(SAM spec) "Multiple segments in sequencing" (flag 0x1) is hard to understand and needs careful interpretation. A single segment in sequencing may yield multiple primary alignment records in the case of a chimeric or split alignment. In SAM each line is an alignment, not a segment, so the word segment doesn't really have any impact on whether we expect to see it once or not in the file. I think it's hard therefore to have a rule that validates this.

The SAM spec states if 0x1 is set then no assumptions can be made about 0x40 and 0x80. For validation we cannot therefore state that they must be unset, so this needs to be clear that it is a SAMstrict additional rule rather than a SAM validation rule. Similarly for the flag 0x4 rules and RNAME *.

CIGAR operators can, in theory, be entirely insertion. A combination of P and I ops permits sequences overlapping a large insertion to be represented within SAM. I don't see how this violates the SAM spec.

Similarly it is valid to have reads starting/ending with insertions (deletions are a bit hard to argue, but maybe possible in some chimeric situations?). (Neighbouring I and D ops are also valid, despite being argued against in the past.)

Unmapped reads should have their RNAME/POS identical to one of the original template. (It's possible to have more than 2 segments, although not in use for modern techniques I think.) Again this is a SAMStrict restriction rather than a SAM semantic validation issue, so needs appropriate markup. Also I could envisage a joint assembler / aligner that has concluded a read belongs somewhere inside a large insertion at point X, but it doesn't have a proper alignment for it so it "places" it there as unmapped.

I don't see why 0x800 (supplementary) requires the existance of an SA tag. The tags are useful, but optional. Tools can trivially just check for the tag existance and this is less work than running a new tool to strip out all reads with flag 0x800 and no SA tag so this doesn't gain us anything.

"Each chimeric alignment must have a record with FLAG 0x800 not set" - this comes down to my argument before about complete vs incomplete data sets. If we do a range query to produce a new SAM, the above constraint can trivially be violated. Sub sets of SAM are still valid SAM and we should never treat them otherwise as this is an extremely common operation, not least for purposes of parallel processing. A validator when run on the entire data set can however be in a position to complain about such things.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/281#issuecomment-361912256, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0oh1v5TGxE8nFf3g6v8Ym-SJJjdiks5tQFZVgaJpZM4RcqVQ .

d-cameron commented 6 years ago
  1. A new format derived from SAM but with additional rules.

I've attempted to avoid making a new spec, and merely forcing resolution of ambiguities and semantic consistency. Almost all of the additional rules that you are concerned about are actually in the SAM specs themselves.

  1. A validator for the existing format.

It became pretty clear that the document was going to turn into a checklist for a validator so I've been including base SAM validator rules in there as well (with a \samv15 tag).

Some obvious things which are simply restating the main SAM spec, such as SQ tags SN and LN being mandatory and matching specific regular expressions. These are fine and it's good to be able to give a failure a symbolic name.

Hence \samv15 on about half of the rules I’ve made thus far.

In the StrictSAM you make the HD line mandatory. I'd have loved this to be so when SAM was designed, but sadly it wasn't done and we have thousands of files that this rule would make non-conformant. I'd prefer to have a new SAM version where, say, 1.6 onwards HD is mandatory. Old files don't get validated so rigorously, but we could run a tool that converts them to 1.6 and does corresponding validation. That way we get this into the real spec rather than a sibling spec. The same applies for many other strict rules. For example currently we have "If RNAME is , no assumptions can be made about POS and CIGAR". I think it's justified for a new SAM spec version number to remove the "no assumptions" rule and simply dictate them as 0 and .

These (and most of the others that you consider a sub-spec) have been taken directly from section 2 of the SAM specifications themselves. I've pulled in everything in section 2 of the SAM specs wholesale as these are all things the SAM specifications themselves already consider 'recommended practice'. What was the point behind section 2 to start with? Clearly it is considered part of the existing SAM specs as it is in the document, but why have it in there at all if you're uncomfortable in enforcing these recommendations? Would you be happy if I labelled all rules derived from section 2 of the specs with a \samv15bestpractice tag? Should SAM strict just replace Section 2 of the SAM specs?

Limiting SQ length to 2Gb is arbitrary. Similarly for POS. SAM has no limit as it is entirely textual. BAM and CRAM do, but I could easily see a CRAMv4 where these are 64-bit fields instead. I see no reason to limit the textual version. If we wish to have a BAM validator then that is a subtly different thing. Perhaps this rule needs to come under a compatability / warning level rather than something strict.

Section 1.4 of the SAM specs explicitly restrict POS to [0, 2^31 - 1].

I’d like to add additional validation warnings so e.g. >512Mb would generate a warning that you can’t make a .bai index from this reference.

Please use symbolic names for the FLAG values instead of 0x1, 0x2 etc. We can define them in a table before referring to them (suggest the same ones used in various bits of software).

Good idea.

(SAM spec) "Multiple segments in sequencing" (flag 0x1) is hard to understand and needs careful interpretation.

A single segment in sequencing may yield multiple primary alignment records in the case of a chimeric or split alignment.

A single segment in sequencing may yield multiple primary alignment records in the case of a chimeric or split alignment.

Multiple non-secondary alignment record, yes. Section 1.4.2 does define and constrain a “primary line” and a “supplementary line” wrt chimeric alignments.

In SAM each line is an alignment, not a segment, so the word segment doesn't really have any impact on whether we expect to see it once or not in the file. I think it's hard therefore to have a rule that validates this.

We can constrain it to make sense though. 0x1, 0x2, 0x4, 0x8, 0x40, 0x80 all refer to segments, and SAM Section 1.4.2 point 1 requires that each “read” has one “primary line”. The SAM specs themselves are indeed a bit inconsistent about what is a “segment” and what is a “read” so I interpreted “read” as “segment” which I think is a reasonable position to take. If a record says it’s part of a pair-end read from a template, then we can validate that the other “primary line” actually exists regardless of whether there are any additional secondary of supplementary alignments.

The SAM spec states if 0x1 is set then no assumptions can be made about 0x40 and 0x80. For validation we cannot therefore state that they must be unset, so this needs to be clear that it is a SAMstrict additional rule rather than a SAM validation rule. Similarly for the flag 0x4 rules and RNAME *.

I’ve tagged all the SAM rule as \samv15.

CIGAR operators can, in theory, be entirely insertion. A combination of P and I ops permits sequences overlapping a large insertion to be represented within SAM. I don't see how this violates the SAM spec.

Section 1.4.4 defines POS in terms of the first CIGAR operation that “consumes” a reference base. An insertion CIGAR does not consume any reference bases thus does not have a POS. I agree that 100I is a really useful CIGAR to have, but as the specs currently stand, POS is meaningless for such records.

Similarly it is valid to have reads starting/ending with insertions (deletions are a bit hard to argue, but maybe possible in some chimeric situations?). (Neighbouring I and D ops are also valid, despite being argued against in the past.)

Unmapped reads should have their RNAME/POS identical to one of the original template. (It's possible to have more than 2 segments, although not in use for modern techniques I think.) Again this is a SAMStrict restriction rather than a SAM semantic validation issue, so needs appropriate markup. Also I could envisage a joint assembler / aligner that has concluded a read belongs somewhere inside a large insertion at point X, but it doesn't have a proper alignment for it so it "places" it there as unmapped.

In that case, we can argue that it should be mapped with an insertion CIGAR, not unmapped to a random location in the genome.

Technically, one could argue that sequencing technique such as Chromium are multi-segment but their random locations don’t fit the SAM multi-segment model anyway so it’s a moot point.

I don't see why 0x800 (supplementary) requires the existance of an SA tag. The tags are useful, but optional. Tools can trivially just check for the tag existance and this is less work than running a new tool to strip out all reads with flag 0x800 and no SA tag so this doesn't gain us anything.

I enforced this as otherwise multiple secondary chimeric alignments could not be resolved unambiguously. This is somewhat of a symptom of the debate a few years back about whether alignments came in set of chimeric alignment records (the interpretation I used), or as nodes in an alignment graph. Happy to drop the requirement down to requiring SA tags on all or none for a given read.

"Each chimeric alignment must have a record with FLAG 0x800 not set" - this comes down to my argument before about complete vs incomplete data sets. If we do a range query to produce a new SAM, the above constraint can trivially be violated. Sub sets of SAM are still valid SAM and we should never treat them otherwise as this is an extremely common operation, not least for purposes of parallel processing. A validator when run on the entire data set can however be in a position to complain about such things.

Not true. SAM Section 1.4.2 point 1 states: “For each read/contig in a SAM file, it is required that one and only one line associated with the read satisfies ‘FLAG & 0x900 == 0’.” If you have any secondary or supplementary alignments, sub-setting your valid SAM will most likely result in a file which is not a valid SAM. This particular constraint is part of the SAM specs (and is marked as such), not an additional SAM strict check.

From a practical perspective, sub-setting is done all the time and we need to be able to deal with it. How does an overall SAM strict recommendation that these all validators should have the capability of performing validation only a specified a subset of genomic regions sound to you? Coordinates are required as I don’t want to allow arbitrary subsets as random downsampling that loses half your read pairing should be treated very differently to just validating the chr1 records.

d-cameron commented 6 years ago

I've made a PR to continue this as discussing particular rules is much easier with a diff in front of you.