VCF ambiguities - Githubissues

d-cameron commented 9 years ago

The following syntax ambiguities/parsing issues remain in the draft VCF 4.3 specs:

Non-printable ASCII characters:

U+0000 causes causes paring issue for C implementations
Allowing alternate line separators (such as RS) within text breaks text file format conversion (round-trip LF->RS->LF not possible if RS is an allowable string character)

Recommended solution:

U+0000-U+0008, U+000B-U+000C, U+000E-U+001F disallowed anywhere
Line separators MUST be CR LF or LF
CR, LF allowed only as line separators at end of line
Meta-information lines
Meta-information line format not fully defined
Example in 1.1 contains undefined headers
Allowable characters in header key not defined
contig line in Example 1.1 contains taxonomy key that is not double-quote escaped

Recommended solution:

Allow arbitrary meta-information lines
keys must be alphabetic [a-zA-Z]+
- Optional: user-defined keys start with lowercase [a-z][a-zA-Z]*
field definition optional keys (eg ID, Description, md5, url, etc) must conform to [a-zA-Z0-9_]+ -Double-quote escaping MAY be used in custom headers -Double-quote escaping MUST be used in any field definition header whose value contains any of '= > " \ :'
Percent encoding
TAB, CR, LF and other troublesome characters are not encoded
Literal % is written to an INFO field by at least one existing SV caller

Recommended solution:

Reference RFC1738 when referring to "URL encoding"
- the following characters MUST be encoded CR LF TAB % : ; (and U+0000-U+001F if disallowed)
- Optional: require = also be encoded (slightly simpler parsing since INFO can then be split on =)
- U+0100 and above MUST NOT be encoded (since they cannot encoded in 2 hex digits).
Implementations SHOULD treat a % not followed by 2 hex digits as a literal %
contig name
<DEL> is a valid contig name
DEL is a valid ID String (which causes issues for SVs such as "A[<DEL>[" )

Recommended solution:

Valid characters should use the SAM regex of "[!-)+-<>-~][!-~]*" but restricted the following additional characters "<>[]:"
contigs names MUST NOT use a reserved symbolic alternate allele name
end of record/file
should there be a line separator at end of file?
should there be a TAB after the last record in each line?

Recommended solution:

A TAB SHOULD NOT be written after after the last record in each line.
Implementations SHOULD write a line separator after each line
Implementation MUST consider a line separator at end of file optional (including for VCFs with 0 records)
"String"
Characters allowed in a "String" field should be defined / disallowed list is not exhaustive

Recommended solution:

Regex for each field and global list of reserved characters
Float parsing
Is scientific notation allowed?
What locale should float/int <-> string parsing be performed in

Recommended solution:

All integer values are of the form "\-?[0-9]+?"
All Numeric/Floating point values are of the form "\-?[0-9]+(\.[0-9]+)?"
Misc
Section 1.4.1.8 (Data lines, Fixed fields, INFO) does not list reserved SV fields -> add reference to
Official 'Description' text for sepc-defined fields not set -> add official section/example containing all headers defined by the specs (inc INFO SV headers)

pd3 commented 9 years ago

This is excellent Daniel, thank you. +1 from me.

jmarshall commented 9 years ago

should there be a TAB after the last record in each line?

No, and this is already mostly-adequately described: a final TAB in a line would introduce a final empty field, and those are disallowed by §1. This could be clarified by describing the lines as tab-separated rather than tab-delimited. So in §1.3 and §1.4.1:

-The header line is tab-delimited.
+The fields of the header line are separated by TAB characters.
@@ ...
-All data lines are tab-delimited.
+The fields of data lines are separated by TAB characters.

ghost commented 9 years ago

Float parsing

Is scientific notation allowed?

All Numeric/Floating point values are of the form "-?[0-9]+(.[0-9]+)?"

To me it doesn't seem necessary to limit a float to the regex above. AFAIK parsing scientific notation isn't a problem so either notation should be valid. Personally I'm thinking of VCF files generated by algorithms that may need to output several very small numbers (admittedly the preceding 0's would compress). What is the reasoning behind limiting floating point numbers to full notation?

pd3 commented 9 years ago

@drjsanger This escaped my attention, sorry. I also agree that it must be possible to express floating point numbers using the scientific notation, for example 1e-10 should be valid. This regular expression should allow that '-?([0-9]+)?.?[0-9]*([eE][+-][0-9]+)?'. Note that by this we also disallow NaN, inf and similar.

jmarshall commented 9 years ago

Re float parsing, the regexp used in the SAM spec is [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?, which better represents real-world desirable +/- usage. It's unclear to me why we would disallow NaNs and infinities though — they are explicitly valid in BCF.

An alternative approach to saying what text floats should look like is to refer to the C or Java definitions of the relevant functions. So we could say something to the effect of: float values contain no white space but otherwise are as parsed by C's atof() in the default "C" locale or equivalently by Java's Float.valueOf() without any FloatTypeSuffix, perhaps except that binary/hex representations are disallowed. Admittedly there's a bit much "without this bit" in this approach, but it has the advantage over the regexp of saying what the digit string is supposed to mean…

bhandsaker commented 9 years ago

I think NaN and +/-Inf should definitely be allowed as float values.

I'm a little unclear about the issue for contig names, but I think contig names should not be allowed to look like symbolic alleles (starting with "<" and ending with ">") to avoid ambiguities. This seems like a reasonable restriction to me on contig names. I don't think it is reasonable to think that we can pre-define all important/reasonable/useful symbolic alleles. Symbolic alleles give us a way (without going outside the specification) to experiment with how to represent complex structural haplotypes.

ghost commented 9 years ago

I have to agree, I think NaN and +/-Inf should be valid as float values.

pd3 commented 9 years ago

@drjsanger @bhandsaker I remember there was an informal discussion long time ago that +/-inf can be replaced by reasonably big floats, but I am happy with +/-inf as well. However, what is the use case for NaN? We already have a missing value ".", parsers would have to check for both NaN and "."

bhandsaker commented 9 years ago

NaN is a valid float value. Depending on the API, a missing value "." or just no attribute present may be represented differently than returning a float. Also in some cases there may be difference between missing/absent (i.e. test was not performed) or NaN (test was performed but did not produce a valid value). Note that BCF distinguishes between NaN and missing (signaling NaN is used to store missing, quiet NaN to store a true NaN value).

d-cameron commented 9 years ago

As long as locale is explicitly specified, referring to another language specification seems to be the best way to fully define support. atof() might not be the best example as we'd want to disallow both leading whitespace, trailing garbage, alternate locales. I presume base 10 encoding is preferable for human readability purposes. Does it make sense to define a minimum round-trip fp precision? I've had errors introduced by tools truncating to 2dp, but that might be outside the scope of the specs.

d-cameron commented 9 years ago

@bhandsaker I was under the impression that complex structural variants were intended to be represented as any number of breakend variants all sharing matching EVENT identifier, with the predefined symbolic alleles used for common local SVs.

bhandsaker commented 9 years ago

I believe breakends are better for describing the kinds of structural rearrangements one finds in cancer genomes. Our lab focuses more on complex common SVs, such as 17q21.31 http://www.ncbi.nlm.nih.gov/pubmed/22751096 and similar sorts of situations where we might want to represent shared haplotypes in the population but may not have the level of specificity implicit in or required by the breakend notation.

I'm not suggesting any specific changes to the spec, just that we don't add language that would make a VCF invalid if it contains a new symbolic allele (although all symbolic alleles SHOULD be defined in the header).

d-cameron commented 9 years ago

@bhandsaker that in itself is a change to the v4.3 spec as the draft currently has a pre-defined fixed set of top-level SV allele as opposed to the open-ended style of v4.2. It seems like what you want to achieve can be resolved by reverting to allowing symbolic allele of any name, forcing interpretation of any reference matching a symbolic allele name as a reference to the symbolic allele and not a named contig with the same name.

It's looks like more than just the ALT header keys that would need changing for properly modelling complex SVs. SVTYPE is currently restricted to DEL, INS, DUP, INV, CNV, BND, and complex SVs don't fit neatly into any of those, although you could probably hack 17q21.31 into a named CNV allele. Does your use case involve representing such common variants as single-line named SVs at a nominal haplotype position, or is it something else?

On a related note, the specs still seem unclear as to whether alleles such as TTTA<DEL>, or AA<namedcontig1>TAAA<namedcontig2>NNN are allowed valid allele.

bhandsaker commented 9 years ago

@d-cameron @bhandsaker that in itself is a change to the v4.3 spec as the draft currently has a pre-defined fixed set of top-level SV allele as opposed to the open-ended style of v4.2. Can you point me to where this is changed so that symbolic alleles have to come from a pre-defined fixed set?

d-cameron commented 9 years ago

Line 146-153 of the initial VCF 4.3 branch change set https://github.com/samtools/hts-specs/commit/af41ed9d52c222b76193ddc9541b4539ea952147

bhandsaker commented 9 years ago

Oh. heh heh. It's possible that I wrote that actually as part of the original SV specification. Well, I don't think it's worth loosening the specification. Sorry for the confusion.

pd3 commented 9 years ago

@d-cameron The RFC1738 is primarily about URLs and most of the unsafe characters relevant to URLs do not apply to VCF. Wouldn't it be better to give an explicit list of characters that can be encoded?

%3a  ..  :  (colon)  
%3b  ..  ;  (semicolon)
%3d  ..  = (equal sign)
%25  ..  % (percent sign)
%2c  ..  ,  (comma)
%0d  .. CR
%0a  .. LF
%09  .. TAB

The percent sign should be always written as %25, I think.

EDIT: added TAB to the list

d-cameron commented 9 years ago

I'd be happy with an explicit list of characters, although your list is missing TAB.

I've used RFC-style verbs MUST and SHOULD to indicate things that implementations need to implement, and things it would be nice to have. A percent sign MUST be encoded, but if a VCF is encountered in which it is not, then it would be nice for the implementation to fall back to literal %. The idea behind this is that it would allow an implementation to use the 4.3 decoding algorithm on earlier file versions which allow a literal % in the INFO field without dying. Happy to have omit that section to prevent confusion.

On Tue, Jun 23, 2015 at 12:12 AM, pd3 notifications@github.com wrote:

@d-cameron https://github.com/d-cameron The RFC1738 is primarily about URLs and most of the unsafe characters relevant to URLs do not apply to VCF. Wouldn't it be better to give an explicit list of characters that can be encoded?

%3a .. : (colon) %3b .. ; (semicolon) %3d .. = (equal sign) %25 .. % (percent sign) %2c .. , (comma) %0d .. CR %0a .. LF

The percent sign should be always written as %25, I think.

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/89#issuecomment-114119940.

pd3 commented 9 years ago

Ah, sorry, added TAB to the list. What others think about the literal % sign?

bhandsaker commented 9 years ago

I guess you mean the proposal to fall back to treating % literally (presumably if it is not followed by a digit). I think it's OK, but another possibility is to just trust the file header. If you want the old behavior, set your version to 4.2.

On 6/23/15 4:11 AM, pd3 wrote:

Ah, sorry, added TAB to the list. What others think about the literal % sign?

— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/89#issuecomment-114399893.

cwhelan commented 8 years ago

+1 to the section on Meta-information lines and quote-escaping

jmarshall commented 5 years ago

FYI re https://github.com/samtools/hts-specs/issues/89#issuecomment-111412189 and the following comments on NaNs and infinities: the 4.3 spec was modified to allow them in db372bca317e5576a5e082e63c395ac702054dee later that June, but that was never mentioned in this discussion.

The text added is that Float values are formatted “to match the regular expression ^…numeric…$, NaN, or +/-Inf”. It is not explicit whether NaN and Inf are intended to be case sensitive, but as they are adjacent to a regexp saying …[eE]… the reader's best guess is probably that they are intended to specify that exact capitalisation.

C's printf can output nan/NAN and [-]inf/[-]INF or [-]infinity/[-]INFINITY (there is no programmer control over whether infinity is 3 or 8 letters), while its scanf understands all of those completely case-insensitively. Meanwhile Java's Double.valueOf accepts only [+-]?NaN and [+-]?Infinity spelt and capitalised exactly thus.

Thus the VCF specification's NaN and +/-Inf are difficult to output in C and its +/-Inf is difficult to parse or output in Java. See also samtools/bcftools#755.

IMHO the right way to fix this in the specification is to apply Postel's law and change that text to match what scanf accepts (i.e., {+,-,}{NAN,INF,INFINITY} case-insensitively).

samtools / hts-specs

VCF ambiguities #89

Non-printable ASCII characters:

Meta-information lines

Percent encoding

contig name

end of record/file

"String"

Float parsing

Misc