samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
633 stars 174 forks source link

sam: Clarify case-sensitivity of read group platform values #679

Open zaeleus opened 1 year ago

zaeleus commented 1 year ago

This is regarding Sequence Alignment/Map Format Specification (2022-08-22).

§ 1.3 "The header section" lists valid values for read group platforms:

Valid values: CAPILLARY, DNBSEQ (MGI/BGI), ELEMENT, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT (Oxford Nanopore), PACBIO (Pacific Biosciences), SOLID, and ULTIMA.

Given the typewriter (tt)/monospace font used for these values, I interpret these are to be parsed as is, i.e., they are case-sensitive. However, there seems to be many files and datasets that failed to follow the spec in this way. For example, the 1000 Genomes Project uses the lowercase form of ILLUMINA:

$ samtools head https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00096/high_coverage_alignment/HG00096.wgs.ILLUMINA.bwa.GBR.high_cov_pcr_free.20140203.bam | grep "^@RG"
@RG ID:H7AGF.1  PL:illumina PU:H7AGFADXX131213.1    LB:Solexa-206008    SM:HG00096  CN:BI
@RG ID:H7AGF.2  PL:illumina PU:H7AGFADXX131213.2    LB:Solexa-206008    SM:HG00096  CN:BI

Are read group platform values supposed to be case-insensitive, or are headers using the lowercase form not spec-compliant?

claymcleod commented 1 year ago

This is an important question that, though simple, could render entire swaths of genomics data out of spec. I would encourage us to consider it carefully.

I, too, have come across many datasets that use the lowercase version of the platform unit (illumina in particular). There is a related discussion on this noodles issue. As I explain in my comment there, unless there is a compelling reason not to, I would recommend not enforcing the case sensitivity on these values.

There are three arguments I'd use to advocate for this position.

  1. Lowercasing the PL is already pervasive in the community. Whereas the spec could be interpreted either way (hence this question), the default behavior of the core libraries in the ecosystem appears to be to allow this behavior. Indeed, I have never run into any issue related to this value until working with @zaeleus's noodles library (of which I am very thankful is spec-compliant and have advocated it remain that way—even at the expensive of much of our data being considered non-compliant within noodles).
  2. Lack of consistency. The other purely descriptive data fields (of which I consider PL) within a @RG header line do not enforce case-sensitivity. I loosely define a data field as purely descriptive if it serves no other purpose than file providence (i.e., the field has no bearing on how to interpret the contents of the file based on the value assigned). Looking through the list, I considered ID, CN, DS, DT, LB, PG, PM, PU, and SM to fall into this category. Note that FO does require case-sensitivity, but I think it makes sense in that case, and I would not consider that purely descriptive.
  3. Lack of value. That I can see, there is no compelling reason to enforce this behavior, as this is merely a descriptive field. Perhaps there is a good reason to enforce this, but I have not been able to think of one. So, from an idealistic standpoint (without the constraints of the ecosystem already allowing this behavior as I describe earlier), I still would not advocate for this behavior.

Note that, if this lowercase values are determined not to be spec compliant, then literally tens of thousands (if not more) files will need to be reheadered to become spec-compliant.

jkbonfield commented 1 year ago

I think it is intended to be case sensitive, and would argue that people treating it as case insensitive are making huge assumptions given there is nothing in the specification to state that. I wouldn't argue that the spec is ambiguous. It states it must be one of these list of words and while not explicitly stating "as written in this case", assuming the spec permits lowercase is reading something which simply doesn't exist.

That said, if it's that dominant in the wild already, then the ship has sailed and the horse has bolted, so I don't see any value in strictly enforcing it. It would be very useful if one of the main data archives could trawl their headers to give us some guidance here. Is it purely 1 single project that chose lowercase, or is this a widespread usage?

Finally, I think we shouldn't advocate changing the spec to make this case insensitive, but rather state it should be written in uppercase but due to historic lack of compliance tools may wish to perform any checking in a case-insensitive manner.

ohofmann commented 1 year ago

It would be very useful if one of the main data archives could trawl their headers to give us some guidance here. Is it purely 1 single project that chose lowercase, or is this a widespread usage?

I'll start looking for Australian Genomics. From previous runs I know bcbio uses lowercase for illumina; unless someone overrides this in the configuration the majority of SAM/BAM files generated by it will not be spec-compliant. Looking at a re-analysis of Australian Genomics data Michael Franklin (@illusional) kindly pointed out that Dragmap itself sets PL:PL0 as a default:

https://github.com/Illumina/DRAGMAP/blob/2e0ec6856bf88f869e89283b650ada53bac75d80/src/include/sam/SamGenerator.hpp#L173

.. which we are of course also seeing downstream. That does not bode well wrt enforcing PL at all, regardless of the upper/lowercase discussion.

jkbonfield commented 1 year ago

Ok that's plain wrong from Illumina there, and completely unhelpful to users. I'll raise an issue with them to fix this. Thanks for reporting it.

jmarshall commented 1 year ago

Certainly the intention is that they be written exactly as shown.

Historically samtools/htslib has not enforced anything here, as it leaves that sort of thing to end-user applications that actually use the PL value. OTOH HTSJDK and hence Picard requires these to be one of the listed keywords, case insensitively, and has always done so. [Edited to add: it turns out that this is only enforced in ValidateSamFile; other Picard tools do not validate PL when merely streaming through the headers.]

So the practical clarification to be made (if any is warranted) would be to say that these values should be written in uppercase, and on reading may be accepted in lowercase.

Files containing that PL0 value will already be rejected by HTSJDK-based tools.

tfenne commented 1 year ago

@jmarshall I don't think that's right about HTSJDK. It didn't sound right to me, so I just took a small SAM file and edited the PL tag to whatever and ran it through the lastest picard's SortSam, which ran to completion without a problem, and carried the PL value through just fine. It's possible that some GATK tools may fail that need to know the platform, but unless there's a super recent change I think HTSJDK is quite permissive.

jmarshall commented 1 year ago

I mostly can't check fully myself at the moment (travelling), so that was just based on a quick check of the code and my recollections. Perhaps the code I saw is only exercised in ValidateSamFile etc. (Search for INVALID_PLATFORM_VALUE.)

claymcleod commented 1 year ago

picard ValidateSamFile run against one of these non-compliant illumina values exits cleanly for me, even with --VALIDATION_STRINGENCY STRICT (which is the default). Note that I did have to run the command with --IGNORE INVALID_MAPPING_QUALITY for the files I was looking at from 1000 genomes, but that shouldn't affect the issue at hand.

Following the chain of execution, it appears that the validation is occurring in the SamFileValidator class, which is retreived from htsjdk: https://github.com/broadinstitute/picard/blob/master/src/main/java/picard/sam/ValidateSamFile.java#L267.

Tracing is all the way back, the relevant lines are here: https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/SamFileValidator.java#L616-L623. You can see that htsjdk actually converts the value to uppercase before doing the check, which is effectively allowing case insensitivity.

EDIT: When using PL:whatever, I indeed get

ERROR::INVALID_PLATFORM_VALUE: The platform (PL) attribute (whatever) + was not one of the valid values for read group
jmarshall commented 1 year ago

Yes, like I said in my first comment, HTSJDK's check on these values is (unfortunately) case insensitive.

The interesting question is what you see with ValidateSamFile for PL:whatever.

claymcleod commented 1 year ago

With respect to the line of logic and after reviewing the spec again, I agree that it's clear that these should be uppercase. I also agree that something needs to be done to practically address the situation, as rigidity here would mean a significant chunk of data out there is just plain out of spec for little practical gain.

I'm mixed on what the solution is here. On the one hand, I think I like the idea in principle about specifying that new values written should be uppercase and reading values should allow lowercase. Practically, I think leaving the guidance at that may be hard to operationalize or may cause further problems down the road.

For example, how should a library report back the value to the user if it reads a lowercase illumina—should it report it verbatim to the user? You can probably see where this might cause problems: if I wrote a simple program to read a SAM file and output the same SAM file, the library would allow me to read the file but not write it. I personally would find that confusing.

There is another potential solution which is that, while PL values can be read as lowercase, they should be converted to uppercase internally by the library before giving the value to a user. This matches, for instance, what htsjdk does, so no changes would be required to make that library spec compliant. Further, I think this is probably cleaner to implement from a library maintainer's perspective rather than trying to two separate valid values: one when read and one when writing. Last, this would effectively "self-heal" these files as they get passed through libraries that handle them appropriately.

The potential downside to this approach is that data changes from underneath the user without their knowing: I don't know if I think that's a good idea or not, and I don't know if there is any precedent for this in the community.

tfenne commented 1 year ago

Without setting stringency to Strict - nothing. At Strict it emits:

ERROR::INVALID_PLATFORM_VALUE:Read name A, The platform (PL) attribute (whatever) + was not one of the valid values for read group
jmarshall commented 1 year ago

@claymcleod: Those library API questions are considerations within implementations, but IMHO they are beyond the spec's domain of interest.

@tfenne: Thanks. So HTSJDK is permissive by default but enforces (case insensitive) valid PL values in strict mode. So from an interoperability POV, the question is whether people testing their SAM files with wacky PL values also validate in strict mode to do so…

ohofmann commented 1 year ago

That said, if it's that dominant in the wild already, then the ship has sailed and the horse has bolted, so I don't see any value in strictly enforcing it. It would be very useful if one of the main data archives could trawl their headers to give us some guidance here. Is it purely 1 single project that chose lowercase, or is this a widespread usage?

I grabbed a (mostly random) assortment of 10,000 BAMs from the Australian Genomics' repository. Of those, 2/3 did not set RG:PL at all; of the remaining 3,736 BAMs we have:

These come form ~10 different sequencing centres/cores and 20 different flagships so at least for AG this is widespread.

claymcleod commented 1 year ago

@claymcleod: Those library API questions are considerations within implementations, but IMHO they are beyond the spec's domain of interest.

This could be true, and that would be up for you all to decide. I was thinking it could be within scope if it was a recommendation made within § 2 “Recommended Practice for the SAM Format”, which contains similar implementation-specific recommendations.

Beyond allowing the lowercase items to be read, I'm rooting for a consistent user experience (or, at least as much as possible) from the ecosystem; my worry is simply that leaving the guidance at "you must write uppercase, but you can read lowercase" would lead to a situation where every implementation handles this in a different way (not much different than today).

jkbonfield commented 1 year ago

I grabbed a (mostly random) assortment of 10,000 BAMs from the Australian Genomics' repository. Of those, 2/3 did not set RG:PL at all; of the remaining 3,736 BAMs we have:

* 151 PL:ILLUMINA

* 3487 PL:illumina

* remaining set to invalid entries, usually a combination of sequencer/lane

These come form ~10 different sequencing centres/cores and 20 different flagships so at least for AG this is widespread.

Ouch! So dominated by out-of-spec tags. I'm guessing it's mainly specific pipelines and software, with some being correct and some not, but the incorrect ones are clearly heavily used in some places. Edit: it's also sad to see most files don't bother documenting this. Data provenance is paramount, but it's pretty woeful in our industry. :(

I think that basically nails it; the ship has sailed already.

Note above I didn't say "you must write uppercase" but "it should be written in uppercase". While we don't explicitly state that we use the RFC 2119 language terms, I think the common parlance is in line with requirement vs recommendation here.

Whether or not a specific tool decides it wants to "fix" incoming data when doing read/write code is an implementation specific detail. Here we simply define the specification and give recommendations. It's up to tools to try and stick to the requirements and make judgement calls on the advisory things, with appropriate documentation. We can of course (and indeed should) give appropriate recommendations where necessary, and this is one of those, but I wouldn't get too bogged down in the fine details of who corrects what and where. If it's at all debateable, it'll just cause arguments and some people won't honour it anyway meaning you get more spec breakage, not less (as we've seen with TLEN in the past, even from the the same author of the spec change and tools!).

In summary my proposal would be to add a footnote to valid values:

The PL field is a controlled vocabulary and should be written in uppercase. However any validation should be case insensitive due to the existance of many non-compliant files.

claymcleod commented 1 year ago

In summary my proposal would be to add a footnote to valid values:

The PL field is a controlled vocabulary and should be written in uppercase. However any validation should be case insensitive due to the existance of many non-compliant files.

I really like this, FWIW.

bpow commented 1 year ago

To be somewhat generous to people who have developed pipelines for NGS for many years, the restriction of enumerated values in the PL field wasn't always part of the specification. So there are SAM/BAM files out there that were compliant with the specification as written at the time they were written.

In another instance of provenance being lost (or at least hard to find), the addition of the PL enumerated values predates the svn -> git migration, so it's hard to find out when. It was probably around v1.2 of the spec, some time in late 2010 from what I can tell from some discussions at the time.

This may be giving non-conformant files too much of a benefit of the doubt (since that was a very long time ago in sequencing technology years), but my 2 cents would also be to be less strict about this in the specification.

While I can see the benefit of a controlled vocabulary here, it also leads to the side effect that a group incorporating a new sequencing platform will not only have to validate that platform but also need to incorporate and validate new versions of other software just so they can use a newer library (htsjdk, noodles, or something else) that uses the updated spec version that includes that new PL enumeration including this technology.

So FWIW, as an external observer/user, I'd favor the "should" for uppercase at the very least so that existing files don't need to be re-headered, and would also advocate for consideration that the used of a controlled vocabulary might be a "should" rather than "must".


Also, as pointed out in the noodles issue discussion, the current wording is ambiguous and could bear clarification one way or the other: "This field should be omitted when the technology is not in this list (though the PM field may still be present in this case) or is unknown."

If this field SHOULD be omitted when the technology is not in this list (as written), are files that incorporate other PL values truly non-conformant? If you really want to restrict to enumerated values, then it would be better to have "This field MUST be omitted when the technology is not in this list..."

jkbonfield commented 1 year ago

We need to be careful to separate two things here - what the spec says is correct and what the implementations do. (And personally I don't view the spec as ambiguous, but accept if others deem it so then I guess it must be!)

For tool writers, it is very useful to have a controlled vocabulary. Is it OXFORD_NANOPORE? OXFORD_NANOPORE_TECHNOLOGY? ONT? ILLUMINA vs SOLEXA? PB vs PACBIO? Etc. It's a total minefield unless an ontology is used. This helps a lot for people trying to fit the incoming data to precomputed training sets (accepting that platform alone may not be perfect, it's still better than no knowledge). So as a data consumer, it's valuable.

However that doesn't mean the software tools should strictly enforce these things. Maybe if the tool absolutely needs to know the platform, then sure, but sensibly it'd request the user to specify a command line parameter instead, or fall back to a default. However this repo is focusing on the specification and not specific implementations. It's not really our place to dictate how strict implementations can be, but language such as should can give useful guidance.

Regarding adding new terms to the specification, #651 and #662 show it's a simple process if the vendors engage, which we strongly encourage them to do. Getting these added to tools that care (which obviously most won't) is up to the maintainers of those tools, but as you point out it's probably in their best interest to offer some degree of flexibility so they can cope with new instruments arriving on the scene.

jmarshall commented 1 year ago

This Git repository contains the *.tex and SAM1.pages history from the old samtools subversion repository.

The list of PL values was not present in the SAM1.pages document (from 2009), which just said “PL — Platform/technology used to produce the read”. (The final version of this identified itself as “Version 0.1.2-draft (20090702)” and defined VN:1.0.)

SAMv1.tex, which replaced the Pages document with a LaTeX rewrite, was composed mostly offline by Heng Li in July 2010, as recorded in this samtools-devel thread. Most of this flurry of activity existed only on Heng's laptop of the time. The first version of this that discusses SAM headers appeared in this posting, is committed as 07dc1c67a717a6c5cc9d65eeb8b3c99612744cde, identified itself as “v1.3 draft” (and an update later the same day defined VN:1.3), and says

PL — Platform/technology used to produce the read. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO.

So this list of valid PL values has been in the spec since July 2010, for the entire lifetime of the LaTeX SAM specification.

As James notes and as mentioned in https://github.com/samtools/hts-specs/pull/454#issuecomment-543623293 (and elsewhere in that thread; see also that thread for further discussion of PL upper/lower case), the reason for controlling the vocabulary here is so that tools can infer the underlying technology's error model etc without having to look for 18 different spellings/capitalisations of ILLUMINA/HISEQ/etc.

I personally agree that it is somewhat unfortunate if implementations contain a list of values from the spec at a particular point in time, and validate against that list. OTOH implementations doing that sort of thing is about the only reason there is any consistency in these values and hence why analysis tools can depend on that PL value being “ILLUMINA” (case insensitively, sigh) rather than “SOLEXA” or some other variant.

It is interesting to compare @SQ-PL to @HD-SO, which contains a similar list of valid values — this time listed all in lowercase and similarly without any commentary pointing that out. However I suspect you would find that essentially all SAM files that specify e.g. @HD VN:1.3 SO:coordinate do so in lowercase — because Picard validates this field and accepts only lowercase as written in the spec (in strict mode; in other modes SO:COORDINATE would be turned into unknown).

For @SQ-PL, Picard validates it only in ValidateSamFile and accepts either upper or lowercase (for reasons that are lost in the mists of time; see https://github.com/samtools/hts-specs/pull/454#issuecomment-553941441). Hence where we are today.

Re “This field MUST be omitted when […]”-style wording, note that (1) SAM is not a communications protocol and the SAM spec does not explicitly use the RFC 2119 shouty verbs; (2) the spec is a compromise with reality, in which there is a dominant implementation whose ValidateSamFile command by default still does not accept @RG headers that do not have a PL field.

bpow commented 1 year ago

@jkbonfield -- I completely agree that getting a new platform added to the list of acceptable terms is an easy process. The potential concern with tying the terminology directly to the spec version (and this may be a devil's advocate position is like this: Say a lab has validated an extensive software pipeline that could potentially incorporate several libraries that work with hts data. Now they want to use a different platform in a way that shouldn't affect those early steps of the pipeline, but in order to have their bam/cram files be completely spec-compliant and work with the various libraries they've used, they now need to update htslib from version X to Y, htsjdk from V to W, noodles from T to U... If the only thing that changed in each of those libraries was the acceptable values of the PL field, then re-validation could be relatively straightforward, but if between versions V and W of htsjdk there were more extensive changes in behavior, then that process would be harder. To be fair, the re-validation should be in-depth anyway, but since the libraries themselves typically don't use this info (from my understanding), it's fair to expect them to pass it along.

@jmarshall -- thanks for looking into the additional history. I hadn't realized that SAM1.pages was the next step back prior to the tex document.

... the reason for controlling the vocabulary here is so that tools can infer the underlying technology's error model etc without having to look for 18 different spellings/capitalisations of ILLUMINA/HISEQ/etc.

Again, I completely understand the benefits of having a controlled vocabulary, but would also note that the current specification does not require that the technology, as described by the @SQ-PL alone, uniquely identify an error model. So it is possible that a different PM for the same PL could have a different error model, in which case tools might make invalid assumptions anyway.

Regarding the shouty verbs-- I also understand that the SAM/BAM/CRAM formats are not RFCs and aren't explicitly using those verbs-- I was just commenting from the standpoint that those verbs do have a common usage in specifications, such that some readers may expect that the word "should" implies preferred behavior, but that files or tools not following that would not necessarily be out-of-specification.

So to suggest another possibility: maybe the spec could define a strict subset (along the lines of what Picard already does in ValidateSamFile and specify that libraries should allow users to specify whether the strict or lenient subset is permitted. I'm not sure if that's worth the effort, though, or what other fields might be specified as strict or lenient subsets.

FWIW, for all of the pipelines I'm invovled in, I've checked to make sure that the files we produce going forward will use the shouty ILLUMINA, but am not sure when we might get around to reheadering old files. Or if the decision is to go with @jkbonfield 's suggestion then maybe we won't have to.

jmarshall commented 1 year ago

Both SAM maintainers have proposed essentially the same thing (https://github.com/samtools/hts-specs/issues/679#issuecomment-1288920658, https://github.com/samtools/hts-specs/issues/679#issuecomment-1290215228), so that is the likely direction of an eventual decision.

There are still some details to be thrashed out: for example, I would prefer to say “may/should also accept lowercase when reading” rather than “accept case insensitively” so as to avoid explicitly blessing ILlumInA and the like. (Maybe the people who checked their PL archives could comment on whether they found any such abominations!)

would also note that the current specification does not require that the technology, as described by the @SQ-PL alone, uniquely identify an error model. So it is possible that a different PM for the same PL could have a different error model, in which case tools might make invalid assumptions anyway.

Indeed. In general it is beyond the SAM spec's abilities to adjudicate on such matters. But there is unfinished business here in allowing such things to be specified; see e.g. https://github.com/samtools/hts-specs/pull/454#issuecomment-553945542.

jkbonfield commented 1 year ago

in order to have their bam/cram files be completely spec-compliant and work with the various libraries they've used, they now need to update htslib from version X to Y, htsjdk from V to W, noodles from T to U... If the only thing that changed in each of those libraries was the acceptable values of the PL field, then re-validation could be relatively straightforward, but if between versions V and W of htsjdk there were more extensive changes in behavior, then that process would be harder.

This is why the libraries probably shouldn't be strictly validating by default. It can be a useful feature to enable in order for people to do data validation, but it's not the file format implementations which are processing this data. They present it to a tool and really it's up to that tool to decide how to interpret the data given to it. In this scenario, you don't update your version of e.g. htsjdk in order to get one that accepts PL:MyNewInstrument, but in the process get a bunch of other changes. Instead in a perfect world you carry on using the same version you validated everything against before and if a validation failure triggers then you ignore that specific known issue and continue as before. It needs a tool chain to be aware of future changes, but that's reasonable.

Agreed on error models and PL alone being insufficient.

One very recent example of this is PacBio announce two new machines. Revio is essentially Sequel III, with very comparable error models, and SBB which is a totally different underlying technology. Theoretically the former is addressed using the platform model, while the latter would be best dealt with by adding a new platform (or revised as in John's A:B nomenclature).

I think it may be useful for someone somewhere to maintain a list of known models (not necessarily hts-specs, but we should link to something if not). The ENA, SRA, etc dictionary tables used for their databases are probably the sensible starting point, assuming they even match. Just knowing what strings to look for is a start, but as far as I know this is totally undocumented and only accessable by people internal to the archives, which is far from desireable.

daviesrob commented 1 year ago

ENA's list can be found in https://ftp.ebi.ac.uk/pub/databases/ena/doc/xsd/sra_1_5/SRA.common.xsd (look for PlatformType, and scroll down for the various instrument models).

jkbonfield commented 1 year ago

Or after some hunting, I found this: https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#platform

So not totally undocumented :)

jkbonfield commented 1 year ago

Also, prod to @raskoleinonen - EBI's list of accepted machine types needs to be updated to include the two new platforms SAM supports. Ideally using the same identifiers.

jkbonfield commented 1 year ago

So update from meeting.

We're in agreement basically. We should continue to list them in uppercase and advocate for this usage, but will suggest that implementations should be case insensitive due to the existance of data using lowercase.

I'll produce a PR for this.

raskoleinonen commented 1 year ago

Thank you @jkbonfield. ENA will add support for ELEMENT (Element AVITI) and ULTIMA (UG 100).