Open zaeleus opened 1 year ago
Agreed.
This touches on a two larger problems with percent encoding in VCF:
This results in a parsing ambiguity as an INFO field can have the form INFO_KEY=%25
and it is ambiguous as to whether the intended value is a single %
or the literal string %25
. IIRC, the intent was for, whereever percent encoding is supported, for colons, semicolons, equal, percent, common, CR, LF, and TAB, to be encoded as %3A
, %3B
, %3D
, %25
, %2C
, %0D
, %0A
, %09
respectively, and for exactly these 9 strings to be decoded to their corresponding character. This allows
Missing from the specs are 1) an explicit list of where percent encoded can/cannot be used, 2) an explicit list of what does/does not require percent encoding, and 3) what to do other values that looks percent-encoded
IIRC, the intent was for (1) - just the INFO and FORMAT field values that needed encoding (the specs only explicitly mention encoding in the 1.6.1.8INFO section). (2) - encode/decode all 8 reserved values (3) - treat everything else as literals (including % not followed by one of the 8 possible encoding) so as to maximises backward compatibility with 4.2. That is, VCF percent-encoding is the equivalent of running 8 string replaces when parsing/encoding.
The other interpretation of the specs is that percent-encoding works on any of the 8 reserved characters anywhere in a VCF file. I'm less keen on this interpretation as it's really not needed elsewhere except if you want to use contig names with characters that are no reserved in SAM but are in VCF.
Rereading the specs, s1.2 could do with a bit more clarification. Namely:
"
in the headers? Should these be decoded or treated as literal percentages? What about reserved characters in CHROM, ID or FILTER?%
that does not decode to reserved character should be treated as a literal % (so we don't break VCF with values like KEY=97%
in them (which I've seen in the wild).
Previously @d-cameron wrote:
[…] except if you want to use contig names with characters that are no reserved in SAM but are in VCF.
Which characters do you have in mind that are reserved in one but not the other? As far as I am aware, the rules are aligned between SAM and VCF. The only potential difference I am aware of is #711, on which your opinion would be appreciated.
The last record in
test/vcf/4.3/passed/passed_body_format.vcf
has a FORMAT key namedG%3AS
(percent-decoded toG:S
), which is an invalid identifier. From The Variant Call Format Specification: VCFv4.3 and BCFv2.2 (2022-08-22) § 1.6.2 "Genotype fields":