samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
632 stars 174 forks source link

Allow for UTF-8 field values in header regular expression #726

Open jmarshall opened 1 year ago

jmarshall commented 1 year ago

Use [:print:] in the header regex and note that for ASCII it is equivalent to [ -~] and that the aim is to forbid control characters. Fixes #719.

To be honest, I'm tempted to add the extra [] to the \cclass definition and waste a bit of space each time this appears rather than add the “For brevity” sentence.

An alternative to this PR might be to just leave the regex as [ -~] and add a footnote explaining that this is an oversimplification for fields that allow Unicode values.

github-actions[bot] commented 1 year ago

Changed PDFs as of 229e998b93fe168dc107095d81b9a31ccc75483d: SAMv1 (diff).

github-actions[bot] commented 1 year ago

Changed PDFs as of 3b4e874ec5e426886baa251aed2e6cf671dee521: SAMv1 (diff).

jkbonfield commented 1 year ago

I'm unsure on the extra brackets also. My inclination though is it's probably not worth the hassle of inventing our own syntax and just going with the official double bracket style.

I'm guessing the extra brackets however were to permit things like [:[:alnum:]] being [:A-Za-z0-9] without ambiguity.

zaeleus commented 1 year ago

Note that UTS #18: Annex C suggests :print: character class compatibility for Unicode as \p{graph}\p{blank}--\p{cntrl}, which is likely not the appropriate definition here since it includes :blank:. It may be better to note a property matcher instead, e.g., \P{Other}.

jmarshall commented 1 year ago

Which specific :blank: characters are you worried about inadvertently including? (The obvious worry is TAB, but presumably that is removed by \p{cntrl}.)

zaeleus commented 1 year ago

Ah, yes, you're right. I had a set operation wrong when I tested with an example. :print: seems sufficient for this change.

jkbonfield commented 1 year ago

I see I was assigned this in the last meeting.

Personally my preference is [[:print:]] over [:print:] as it's a standard and ironically the extra couple of characters we add a few times is less text than the "for brevity" statement. Not a hard "must be so" line, but if in agreement I'd prefer that before merging. Otherwise I'm happy with it.