samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
638 stars 174 forks source link

sam: Clarify the usage of UTF-8 characters in header #719

Open zaeleus opened 1 year ago

zaeleus commented 1 year ago

This is in regard to Sequence Alignment/Map Format Specification (2022-08-22).

ยง 1.3 "The header section" defines patterns for header lines:

Thus header lines match /^@(HD|SQ|RG|PG)(\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/.

This invalidates the following test examples:

The text "UTF-8 encoding may be used" for the CL and DS fields does not remove the character set constraint. It also remains arbitrary as to why only some fields have this definition.

jmarshall commented 1 year ago

The problem here is simply that the regular expression has not been updated after the addition of the loosening to allow UTF-8 in certain field values. These test files are not made invalid, and the โ€UTF-8 encoding may be used [in these field values]โ€ text should be genially interpreted as augmenting and superseding the ASCII restriction in that part of the regular expression.

That said, there is indeed a minor inconsistency here and we should improve the [ -~]+ part of the regular expression so that it allows for UTF-8 characters. And perhaps add some text nearby stating that UTF-8 fields are the exception and most fields are limited to [ -~]+.

What it is really trying to say is that field values are strings of one or more non-TAB characters. In expanding this to allow UTF-8 characters, we should consider whether there are any other UTF-8 whitespace characters that we might want to disallow.

It is indeed true that only a few particularly and individually specified fields allow UTF-8 characters beyond ASCII. IMHO the spec doesn't particularly need to explain why the particular choices have been made. OTOH it is useful for the lore to be known amongst the spec's maintainers and contributors.

The general rule of thumb is that anything that would be processed or interpreted by a program is restricted to ASCII. Free text fields that are ignored by programs or are simply displayed to the user (but not interpreted by program code) are candidates for specifying as allowing UTF-8.

The justification for this is that doing things like testing whether two strings represent the same text is non-trivial in UTF-8, as noted by @daviesrob in https://github.com/ga4gh/seqcol-spec/issues/2#issuecomment-760992902 and probably in other similar discussions too. Thus for example allowing UTF-8 in RNAME fields would make it difficult to determine whether two SAM records referred to the same chromosome while not adding any particular expressive power.

jkbonfield commented 1 year ago

We clearly need a footnote somewhere to explain the regexp may need some nuanced interpretation for UTF-8 capable fields. Attempting to include UTF-8 in the regexp will just break things horribly and add more confusion than it solves. I think maybe one way of handling this is to use POSIX character classes as they document the meaning of the terms. [ -~] is basically [[:print:]] as far as I can see. If so, we can then add a footnote stating:

In ASCII-only fields [[:print:]] is equivalent to [ -~], but note a few fields permit UTF-8 and the regular expression should not be taken as resricting that data.

Edit: actually if locale is set correctly, it may be that e.g. POSIX [[:alpha:]] really does mean alphabetical even for UTF-8 symbols, depending on the regexp engine you're using. Basically it's one way of dodging the question. It still does warrant a footnote though to be explicit about ASCII vs UTF-8. Basically the regular expression doesn't cut it for the entire header as it's too broad given different fields have different locales, but it gets the concept over well.

jmarshall commented 1 year ago

There are lots of Unicode whitespace characters, but fortunately none of them have tab-like properties. There are however additional line separator characters, so I have added text to PR #670 (which is no longer draft) to emphasise that SAM file line endings must be ASCII โ€” in particular LF or CR-LF.

we should consider whether there are any other UTF-8 whitespace characters that we might want to disallow.

Upon further reflection, I don't think we should do this. Disallowing them would invite people to think that such characters would act as field or line separators, but we want parsing those delimiters to remain doable by looking only for ASCII. This means that people could put U+2028 (line separator) characters in their @SQ-DS fields and have the @SQ header appear to span multiple lines. We can call this a feature ๐Ÿ˜„

I think maybe one way of handling this is to use POSIX character classes as they document the meaning of the terms. [ -~] is basically [[:print:]] as far as I can see.

That's a pretty good idea. (And [!-~] is [[:graph:]].) I have made PR #726 to try it out.

What the existing [ -~] regex really does is forbid control characters in these field values: no \r or \n, no tab, and no NUL characters. The rest of the control characters we don't really care about, but it is as well to forbid them all. This character class trick is a good way to extend that to all of Unicode.