samtools / hts-specs

Specifications of SAM/BAM and related high-throughput sequencing file formats
http://samtools.github.io/hts-specs/
655 stars 172 forks source link

vcf: Invalid FORMAT key in passed example #689

Open zaeleus opened 1 year ago

zaeleus commented 1 year ago

The last record in test/vcf/4.3/passed/passed_body_format.vcf has a FORMAT key named G%3AS (percent-decoded to G:S), which is an invalid identifier. From The Variant Call Format Specification: VCFv4.3 and BCFv2.2 (2022-08-22) § 1.6.2 "Genotype fields":

colon-separated FORMAT keys matching the regular expression ^[A-Za-z_][0-9A-Za-z_.]*$

d-cameron commented 1 year ago

Agreed.

This touches on a two larger problems with percent encoding in VCF:

This results in a parsing ambiguity as an INFO field can have the form INFO_KEY=%25 and it is ambiguous as to whether the intended value is a single % or the literal string %25. IIRC, the intent was for, whereever percent encoding is supported, for colons, semicolons, equal, percent, common, CR, LF, and TAB, to be encoded as %3A, %3B, %3D, %25, %2C, %0D, %0A, %09 respectively, and for exactly these 9 strings to be decoded to their corresponding character. This allows

Missing from the specs are 1) an explicit list of where percent encoded can/cannot be used, 2) an explicit list of what does/does not require percent encoding, and 3) what to do other values that looks percent-encoded

IIRC, the intent was for (1) - just the INFO and FORMAT field values that needed encoding (the specs only explicitly mention encoding in the 1.6.1.8INFO section). (2) - encode/decode all 8 reserved values (3) - treat everything else as literals (including % not followed by one of the 8 possible encoding) so as to maximises backward compatibility with 4.2. That is, VCF percent-encoding is the equivalent of running 8 string replaces when parsing/encoding.

d-cameron commented 1 year ago

The other interpretation of the specs is that percent-encoding works on any of the 8 reserved characters anywhere in a VCF file. I'm less keen on this interpretation as it's really not needed elsewhere except if you want to use contig names with characters that are no reserved in SAM but are in VCF.

d-cameron commented 1 year ago

Rereading the specs, s1.2 could do with a bit more clarification. Namely:

jmarshall commented 1 year ago

Previously @d-cameron wrote:

[…] except if you want to use contig names with characters that are no reserved in SAM but are in VCF.

Which characters do you have in mind that are reserved in one but not the other? As far as I am aware, the rules are aligned between SAM and VCF. The only potential difference I am aware of is #711, on which your opinion would be appreciated.