samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
655 stars 172 forks source link

sam: Add PL value "MGIG400" as seen in the wild #718

Closed brainstorm closed 1 year ago

brainstorm commented 1 year ago

The BAM header below was refusing to be parsed by noodles due to the PL field (via htsget-rs), as reported by NESI. This BAM field came from this machine. Non-related fields have been manually masked/anonymized with ****:

% samtools head t001.bam | grep PL
@RG ID:**** SM:**** PL:MGIG400
@PG ID:bwa  PN:bwa  VN:0.7.17-r1188 CL:bwa mem -M -Y -R @RG\tID:***
\tSM:****\tPL:MGIG400 -t 24 /some/nobackup/path/DeepVariantAnalysis/resources/genome.fna data/****.fq.gz data/****.fq.gz

Interestingly, this PL field should have been DNBSEQ but the software machine outputs its model (MGIG400) instead, nowadays? Perhaps the vendor should be notified and prompted to update their tooling instead of introducing this change to the standard.

Tangentially related issue worth exploring along this one: https://github.com/samtools/hts-specs/issues/679

/cc @ohofmann @mmalenic @zaeleus

brainstorm commented 1 year ago

Nevermind, MGI doesn't seem to ship aligners. PL field most probably introduced on third party pipeline downstream erroneously.

jkbonfield commented 1 year ago

It's also worth noting even if this was being generated by MGI and we hadn't previously decided on DNBSEQ, it would be rejected. The problem with MGIG400 is it conflates platform with model, which are two different tags.

jkbonfield commented 1 year ago

We did also discuss this sort of issue in the most recent conference call.

The issue was one of validation. Having an invalid field here does not invalidate the rest of the file. Syntacically it would all make sense. The point was raised what if we check these fields and reject files that don't match, but the specification then gets updated? We haven't updated the SAM version number when we've added extra fields here as the syntax is identical, so programs cannot check that either. That's a valid point. We felt the correct process would be, if validation is performed, to make it a warning only. This fits with the point above that unknown data here does not invalidate any remainder of the file.

You could argue then what's the point of having a controlled vocabulary, as it doesn't stop people from just adding anything there anyway (as demonstrated). We feel there is still merit in having PL as a controlled vocabulary, as it gives vendors and users alike a clue as to what is expected. Without it we're highly likely to get ONT, OxfordNanopore, OxfordNanoporeTechnology, Oxford_Nanopore_Technology, etc. That's even ignoring the issue of case sensitivity. With it, well we may still get invalid fields, but hopefully it is significantly reduced. Note that this also ties in with sequence submissions as the archives have a controlled vocabularly in their schemas.

zaeleus commented 1 year ago

Having an invalid field here does not invalidate the rest of the file.

A file can either be well-formed or not. I'm not sure why you would want or trust a somewhat valid file w.r.t. a specification, especially in the scientific domain.

We haven't updated the SAM version number when we've added extra fields here as the syntax is identical

Appending to a list of known values changes the syntax. PL:(CAPILLARY|DNBSEQ|etc) is not identical to PL:[ -~]+.

We feel there is still merit in having PL as a controlled vocabulary, as it gives vendors and users alike a clue as to what is expected. Without it we're highly likely to get ONT, OxfordNanopore, OxfordNanoporeTechnology, Oxford_Nanopore_Technology, etc.

In this case, the spec shouldn't define them as valid values but as suggested values.

jmarshall commented 1 year ago

The syntax of header lines is described in the first paragraph of §1.3: the fields are TAB-separated, and the line matches the regexp shown (notwithstanding the minor UTF-8-related issue you noted elsewhere). So regardless of whatever characters are in the PL field value, there is no difficulty parsing it: there is a TAB before the PL: and the value extends to (but does not include) either end-of-line or the next TAB, whichever comes first.

The list of keywords in the PL description is a list of semantically valid values. The syntax of the header line is unchanged when this list is appended to, as doing that does not affect parsing of that line or of the rest of that file or even (generally speaking) the interpretation of the rest of the file.

Note for example that ULTIMA was added to the spec when the SAM VN version number was already 1.6. Nonetheless a SAM file that says @HD VN:1.3 // @RG … PL:ULTIMA … is a perfectly valid SAM file. This is very intentional: parsing and understanding (the remainder of) the file is unaffected, so it would be silly for it to be invalid.

We've discussed this any number of times, e.g. on #454. I don't see any particular need to relitigate it, but if we do let's do it on a new open issue rather than a closed WONTFIX based on someone mistakenly specifying MGIG400 on their bwa command line.