samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
642 stars 174 forks source link

Problems with allowed contig names in the SAM spec #124

Closed yfarjoun closed 4 years ago

yfarjoun commented 8 years ago

We've recently come across an hg38 contig name of the form 'HLA-A01:01:01:01' which caused problems when used on the commandline in the GATK, as the parser understood the suffix :01 to mean "at position 1" and then is complained that it cannot find HLA-A01:01:01 (which thankfully isn't in hg38..., otherwise we would have probably not noticed this for a while. Thus the argument *-L HLA-A01:01:01:01** is not one that the GATK can handle.

Its too late for hg38 to ask for any name changes, but I think it might make sense to revisit the allowed characters and remove a few. Currently there are a two characters that are not allowed as the first character in the @SN (* and =) but the remaining printable characters (! through ~) are allowed, and they are all allowed as the following characters. I would propose that we add a few more characters to the "not allowed" list. In particular it would be good to avoid the various quotes (',,") anywhere in the @SN and I would like to suggest that the ending should not look like a genomic position, i.e. it should **not** match the regex ':[0-9](-[0-9])?`'. The reason I say "suggest" is that given that hg38 already includes this ending, it would be quite difficult to reverse that and even if we only put it in a future "version" of the SAM Spec it isn't clear what that means since the old references will still be around...

This isn't a formal proposal...I mostly want to know what people think about this issue and see if we can come up with good solutions.

tfenne commented 8 years ago

@yfarjoun I can totally empathize with your problems. I've run into similar problems with other tools (a different variant caller would, internally, invoke shell commands with naked reference names, and the unquoted * caused problems there).

Unfortunately I think this ship has sailed. Fwiw I suspect your issue isn't with the main hs38 assembly, but with the hs38DH reference created by bwa.kit. Any change that would prohibit any new characters in reference names, or prohibit particular suffixes, would necessarily be backwards incompatible. And that's likely to lead to tools, moving forward, not supporting previously valid reference names, and leaving a lot of data stranded and in need of re-writing. I'd suggest that the best we can do is to add a best practice not to name sequences like this.

Presumably the problem you're having can be fixed by checking to see if HLA-A*01:01:01:01 or HLA-A*01:01:01 or HLA-A*01:01 is a valid reference name before choosing how to deconstruct the string? You're only really in trouble if two or more of them are valid reference names.

yfarjoun commented 8 years ago

I was hoping that quotes are not found in the wild in contig names and that we would be able to ban them. I agree about the ending being a suggestion/best practice.

The workaround we are considering is to append ':1+' to the commandline argument. But that only solves the problem for GATK, not your problem with *

yfarjoun commented 8 years ago

Oh, I would also ban the backslash if possible.

d-cameron commented 8 years ago

':[0-9](-[0-9])?' is an issue not only with GATK, but with any variant caller using VCF. Due to parsing breakend parsing ambiguities, the VCF specifications (v4.1) explicitly prohibits colons anywhere in contig names, although implementations generally either ignore this (since they leave such parsing to the end user), or only die when choosing the 'wrong' interpretation of the ambiguity. Greater than and less than have similar issues where could be reference contig of that name or an assembly contig from an external file whose name does not include the tags.

d-cameron commented 8 years ago

I suspect that even if you get GATK working, you'll have continual trouble with downstream analysis due to your output not being a valid VCF file.

yfarjoun commented 8 years ago

I agree. Do you have a recommendation or are you raising this as further evidence that the spec should change?

On Sun, Jan 17, 2016 at 6:16 PM, Daniel Cameron notifications@github.com wrote:

I suspect that even if you get GATK working, you'll have continual trouble with downstream analysis due to your output not being a valid VCF file.

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/124#issuecomment-172394605.

d-cameron commented 8 years ago

I think @tfenne is correct that we can't make the requiring changes without breaking backward compatibility. One option that I think has merit is to introduce the concept of compliance levels and elevate the meaning of section 2 (Recommended Practice for the SAM Format) so that a level 1 compliant SAM/BAM complies the basic section 1 SAM specifications, and level 2 level 2: complies with section 2 recommended practice as well as additional new restrictions such as VCF-compliant QNAMES, CIGARs not having dangling I/D/N operators, fields (eg mate coordinate) and predefined tags (eg NM, MC, R2, Q2, TC, ...) actually have the correct value, read pairs are actually paired and have matching information, RNAME is unique, QNAMEs all match an SG header, FLAG fields make sense, predefined tags are of the correct type, ...

Some of the section 2 recommended practices should only be recommended in some cases so it's important that such a distinction be made. For example, fields that can be regenerated (eg, NM, MC, R2, Q2, SA?, SEQ & QUAL for secondary alignments) are not wanted for archival purposes but can be quite important for acceptable runtime performance of variant calling and other downstream analysis.

A compliance level approach allows us to retains backward compatibility but has the significant advantage that a whole host of edge cases and sanity checks can be externalised into compliance checking/fixing tools. Essentially, level 1 = parses as SAM, level 2 = actually make sense.

tfenne commented 8 years ago

@d-cameron I really like the idea of adding some kind of official compliance/strictness levels to the specification. Picard's ValidateSamFile has for a long time performed significantly more validations that one would infer from a strict reading of the spec, but is incredibly useful. It would be nice to codify some of those validations and similar best practices in the spec in a way that encourages more people to test them.

jkbonfield commented 8 years ago

See also https://github.com/samtools/samtools/issues/149 I'm not convinced all those ideas are sane though! What was I thinking? :)

yfarjoun commented 8 years ago

Another character that seems to be banned from "all VCF strings" is the comma...so probably the comma should not be allowed in in level 2 strictness.

What will happen if we make ":" not level 2 compliant, and then try to process hg38?

droazen commented 6 years ago

Discussion on this issue in the context of the VCF spec specifically (rather than the SAM spec, which was the focus here) is happening in this ticket: https://github.com/samtools/hts-specs/issues/258

jmarshall commented 6 years ago

I have just reread all these contig name issues and learned a little about the HLA allele naming scheme after this hilarity and the ensuing conversation.

Looking at file format specifications in isolation, and using every day as we do the interval notation chrom:start-end that's been in the UCSC genome browser and other tools since forever… from this point of view putting colons in contig names seems crazy.

Elsewhere, the HLA folks have been cataloging their alleles for some decades — previously using notation like A*01010101, in which all the 01s are two-digit subfields in some biologically-meaningful hierarchy. And obviously there wouldn't be more than a hundred allele categories at each level :smile:.

In 2010, they needed more digits, and added colons to their standard allele names. So that name became A*01:01:01:01. And other names like A*02:101 appeared (so heuristics about leading zeros won't help us). Presumably they had a big conference where they decided that using colons was the least-worst option.

A few years later, these names start getting used in reference FASTA files and make their way into SAM files. We could say that we meant for colons to be disallowed in RNAME and invite people to use _ or / or something instead in their HLA contigs, but I don't see any reason why they would listen to us: then the names in their SAM files would differ from in all their other literature, and that would suck.

So I now agree with @tfenne that the ship has surely sailed for colons. Even if we had disallowed them from the start, HLA people were accustomed to colon-containing names and would have used them in their SAM files anyway (because why should they want/need to read the spec, and samtools doesn't check) and filed bugs against Picard's validation if it rejected them :smile:

However judging by the counts in #291, we can still forbid commas, most kinds of brackets, and maybe semicolons without alienating anyone — and IMHO we should do so.


This leaves a problem for tools, which would like to allow users to say chrom:start[-end] conveniently and unambiguously, and for format specifications, which need to parse contig names in fields or as parts of fields unambiguously. There's basically four ways to solve this:

  1. Heuristics
  2. Escaping
  3. Quoting
  4. Extra or distinctive trailing delimiters

For tools, intervals are specified on the command line as well as interactively on web pages or in applications. So escaping with \ or quoting with "' is a non-starter, as they will be eaten by the shell (or need to be double-escaped, which is not convenient). "Quoting" with brackets would be a possible help: e.g. the notation could be chrom:start-end or {chrom}:start-end with the optional braces required when the contig name contains a colon. Or a trailing colon in chrom: could indicate all of chrom explicitly (none of the HLA contigs have trailings colons, and we could forbid that). But this still leaves A*02:101 ambiguous — to solve that requires heuristics. And ideally we wouldn't have to introduce any new syntax like those {} or trailing :.

If the tool already knows the list of valid contig names, it could parse an interval as follows:

if the string foo:bar has a plausible interval suffix (i.e. bar is NUM or NUM-NUM)

if both foo and foo:bar are existing contigs then …error it's ambiguous else if foo is an existing contig then return (foo, NUMNUM) else if foo:bar is an existing contig then return (foo:bar, whole contig) else …error unknown contig

else …either there's no colon or bar is not plausibly numeric

if the whole string str is an existing contig then return (str, whole contig) else …error unknown contig

IMHO it's important to make all these checks: I don't want my chr1:1000-2000 command line argument meaning something else because a malicious BAM file contains a contig named chr1:1000-2000 as well as the chr1 I'm looking for!

(At present, htslib and samtools do part of this: the then part of the outer if is just return (foo, NUM, NUM). I think this means they do the right thing already for HLA-A*30:14L but not for HLA-A*30:14, which needs the existing-contig-name heuristics.)

This is convenient for the current set of HLA names (see https://github.com/ANHIG/IMGTHLA/blob/Latest/fasta/hla_gen.fasta), which have the property that none is a colon-delimited prefix of any of the others (i.e., there's not both e.g. A*01:01:01 and A*01:01:01:01). If the set of names didn't have that property, then the ambiguous error would trigger from time to time and you would need extra syntax (like the {} or trailing :) to disambiguate in these cases.

So I think this heuristic is necessary for convenience in tools' user interfaces and suffices to solve the problem without introducing ambiguities or vulnerabilities (if implemented properly!).


For file formats, heuristics are not an option. Formats need to be simply and unambiguously parsable.

For SAM, I don't think there is a problem with colons: we aren't using a colon to delimit any contig names in subfields as part of any SAM fields. We are delimiting them with commas and perhaps semicolons in the SA tag (and the proposed OA tag), so we ought to forbid those characters in contig names before they become a problem. The counts in #291 suggest that comma at least is certainly not currently a problem.

For VCF, the major concern is around breakends, which look like ]chrom:pos] or [chrom:pos[. As @d-cameron noted on #258 there is no actual ambiguity here — as the :pos part is always present. While a chrom containing colons means you can't use the simple left-to-right parsing of this, it's not much harder to read to the final bracket and then back up to the rightmost colon.

What real parsing problems are there in parsing the formats with colons in contig names? Certainly there are (solvable) problems in tools' user interfaces, but actually there doesn't seem to be too much of a problem for the underlying formats. (Phew!) (Commas OTOH we need to do something about.)

jkbonfield commented 6 years ago

Completely agree. Back in 2014 I viewed this as a tool chain issue (see https://github.com/samtools/samtools/issues/149), but admittedly I didn't realise about break-ends in VCF then. As it turns out it's still not an issue there.

I fully agree also that we need to nail down the unused (or rarely used) cases to limit RNAME, and it's why we gathered that data in the first place. It's already been pointed out that comma causes bugs elsewhere (eg SAM SA tag) as we use comma separated data. The only commas that @daviesrob found were related to a malformed fasta file due to some parsing bug that has put crap all over the place. If it's equally rare elsewhere (and it appears to be) then banning it is a good step. Otherwise we'll look back on this in a few years with Foo,BAR ref name and be yelling at our past selves!

Quoting: agreed also. Originally I proposed =name= as starting with an equals character is forbidden (it's used in SAM mate ref-name to mean the same reference), but having seen braces aren't used yet, thank goodness, it makes for a more usual quoting syntax.

Finally, I'll update the htslib PR I put in (https://github.com/samtools/htslib/pull/708) to do more checks. It does most of your parsing suggestions above already bar the final case of spotting both ways of parsing matching known references. I may as well put in quoting support while I'm at it as a separate commit so we can go with it if we get nominal acceptance here.

Edit: for completeness, I should also point out that the most recent samtools issue we had raised on this was resolved by pointing out that RNAME: (instead of RNAME) to refer to the entirety of the reference was accepted and works regardless of whether RNAME contains colons. It's ugly though and I prefer quoting syntax for command line tools.

Falcury commented 6 years ago

Edit: for completeness, I should also point out that the most recent samtools issue we had raised on this was resolved by pointing out that RNAME: (instead of RNAME) to refer to the entirety of the reference was accepted and works regardless of whether RNAME contains colons.

If by chance you are referring to the issue I raised in https://github.com/samtools/htslib/pull/705, then I should clarify that I do not consider the issue 'resolved' by the fact that a workaround exists. In my project, samtools was (almost) silently failing and producing incomplete results. What saved me was the fact that I dug deeper and found the cause of the problem. Others may not be so lucky.

jkbonfield commented 6 years ago

Yes I did mean that. Sorry I didn't mean resolved in that manner (the problem still exists, hence my PR to address this in better ways), rather I meant in github terms the issue was resolved (by you) as a suitable temporary workaround while the better solution is in review.

Thanks for your help in this.