Open cyenyxe opened 9 years ago
I think we should revive this effort! Having a formal grammar for the VCF (and BAM/SAM) specs would solve many of the issues we've historically had with ambiguities in the wording of the spec documents, and eliminate the need to update our parsers for each version of the spec (since we could generate our parsers directly from the grammars).
@jmarshall @cyenyxe @mp15 @vadimzalunin @lbergelson @lh3 @akiezun @yfarjoun @tfenne What do you think?
Using grammars to define the file formats would be a huge improvement over our current state. It would make it much easier for different tools to offer compatible support of the spec, would make it easier to support multiple versions of the spec simultaneously, and would avoid a lot of ambiguities. ( Although it would be more complicated to make sure that a more human readable version of the spec actually matches the grammar of the spec.)
:+1: I would like to maintain a plaintext version. In the case of any ambiguity we would take the grammar to be correct.
@lbergelson's note is worth noting since I am worried that there will be more work to do maintaining and I'm not sure that the benefits are that great.
On Fri, Jun 10, 2016 at 1:19 PM, ldgauthier notifications@github.com wrote:
👍 I would like to maintain a plaintext version. In the case of any ambiguity we would take the grammar to be correct.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/93#issuecomment-225242558, or mute the thread https://github.com/notifications/unsubscribe/ACnk0h9-iBa7kAKu0g2YrRyRIv8qrGawks5qKZyUgaJpZM4FK5Q1 .
The benefits as I see them are:
The drawbacks:
@yfarjoun The benefits of being able to auto-generate a parser from the latest spec, and ensure consistency across different implementations (eg., htslib vs. htsjdk) are so great that I think it may be worth the up-front effort (which admittedly would be steep).
Reading Louis's list of drawbacks is scary. I am not sure at all that I agree with @droazen's assessment of worth...
On Fri, Jun 10, 2016 at 8:42 PM, droazen notifications@github.com wrote:
@yfarjoun https://github.com/yfarjoun The benefits of being able to auto-generate a parser from the latest spec, and ensure consistency across different implementations (eg., htslib vs. htsjdk) are so great that I think it may be worth the up-front effort (which admittedly would be steep).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/93#issuecomment-225326210, or mute the thread https://github.com/notifications/unsubscribe/ACnk0sSF0ChmdYJJv9QkPip_uxSpA4vmks5qKgRwgaJpZM4FK5Q1 .
Consider the status quo, though:
-htsjdk does not support the latest VCF and BCF specs -backwards compatibility with previous versions of the specs is spotty at best -htsjdk and htslib parsers behave differently in many cases (esp. corner cases) -legal values for most tokens are poorly defined (or not defined at all), and when rules do exist they are often spottily enforced -each time a new version of the spec is released, developers across multiple projects must identify the changes (which itself is not always easy, given the textual format of the spec), implement them manually, determine whether older versions of the files can still be parsed correctly, etc. -those involved in drafting the specs spend a lot of time debating things like should vs. may vs. must, dealing with wording that could be misinterpreted or does not reflect reality, etc., when there are mechanisms available to define all these things precisely
What about strict versus non-strict validation? will you have a different grammer for the different strictness?
On Fri, Jun 10, 2016 at 10:16 PM, droazen notifications@github.com wrote:
Consider the status quo, though:
-htsjdk does not support the latest VCF and BCF specs -backwards compatibility with previous versions of the specs is spotty at best -htsjdk and htslib parsers behave differently in many cases (esp. corner cases) -legal values for most tokens are poorly defined (or not defined at all), and when rules do exist they are often spottily enforced -each time a new version of the spec is released, developers across multiple projects must identify the changes (which itself is not always easy, given the textual format of the spec), implement them manually, determine whether older versions of the files can still be parsed correctly, etc. -those involved in drafting the specs spend a lot of time debating things like should vs. may vs. must, dealing with wording that could be misinterpreted or does not reflect reality, etc., when there are mechanisms available to define all these things precisely
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/93#issuecomment-225331403, or mute the thread https://github.com/notifications/unsubscribe/ACnk0or7Ulorz58r36z7xT6CvWEMOsWjks5qKhqJgaJpZM4FK5Q1 .
I think strictly defining the allowed characters of each field is a good thing, if this has not been done. However, a full-blown grammar is overkilling and actually not very useful. A grammar will still leave uncertainties. For example, in one version of 1000g VCF, there were lines like these:
##INFO=<ID=HOMSEQ,Number=.,Type=String>
20 14370 rs6054257 G A 29.1 . NS=3;DP=14;AF=0.5;HOMSEQ;DB;STRVEC=a,b GT:GQ:DP:HQ:CNL 1|0:48:8:49,51:.
Notably, HOMSEQ
is a variable length array, but on the variant line, it appears to be a Flag
type. There is no way for a context-free grammar to resolve this ambiguity. And as we can't fully describe VCF with grammar alone anyway, my preference is to describe VCF purely with regular language (i.e. the language expressed using regular expressions) which can identify invalid VCFs within the limit of regular grammar.
At the end of this comment, you can find a lex program that accepts VCF lines. You may save the code as vcf.l
, compile and run it with:
lex vcf.l; gcc lex.yy.c; ./a.out < input.vcf
It will report errors the regular grammar can identify. You can probably make a good guess about the lex syntax: a lex program defines sub-expressions and uses them in longer regular expressions later on. The current grammar is too permissive and does not conform to the VCF spec. We may consider to adapt it and to make it stricter, within the limit of regular grammar.
%{
#include <stdio.h>
int lineno = 0;
%}
sign \+|\-
int [0-9]+
float ({int}?"."{int}([eE]{sign}?{int})?)|({int}[eE]{sign}?{int})
str [A-Za-z0-9_:;/\|\.\-,"']+
sym [A-Za-z_][A-Za-z0-9_\.]*
base [ACGTN]
seq {base}+
allele {seq}|<{sym}>
keyval ({sym}=[A-Za-z0-9_:\|,\.\-"']+)|{sym}
chrom {str}
pos {int}
id {str}
ref {seq}
alt {allele}(,{allele})*
qual {int}|{float}|\.
filter \.|({sym}(;{sym})*)
info \.|({keyval}(;{keyval})*)
format {sym}(:{sym})*
sample {str}
bline {chrom}\t{pos}\t{id}\t{ref}\t{alt}\t{qual}\t{filter}\t{info}(\t{format}(\t{sample})+)?
lline #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO(\tFORMAT(\t{str})+)?
def <.*>
miscdef [^<>\n]+
dline ##{sym}=({def}|{miscdef})
%%
^{dline}\n ++lineno; return 1;
^{lline}\n ++lineno; return 2;
^{bline}\n ++lineno; return 3;
^.*\n ++lineno; fprintf(stderr, "Parse error at line %d:\n%s", lineno, yytext); return -1;
%%
int yywrap(void) { return 1; }
int main(void)
{
while (yylex() != 0);
return 0;
}
The fact that a grammar/regular language exists doesn't mean that all tools will have to implement it as such. But having it described in the spec will help to:
Those who need to support non-compliant files would at least have the chance to implement some checks in a way that is still compatible with other tools. Removing the differences between htslib/htsjdk that @droazen cites could be really beneficial for the community, even when considered just from a user perspective.
If any of you want to explore alternatives to the traditional lex/yacc combination, my team has been using Ragel for a while to implement a VCF validator (the need for a non-ambiguous definition is obvious here), and we are quite happy with the result. This doesn't report context-based errors, but a "semantic validation" component can do it.
I think we've gotten pretty far, as you can see in: https://github.com/EBIvariation/vcf-validator/blob/feature/multi-vcf-versions/src/vcf/vcf_v42.ragel#L506 (v4.3 is under construction). It gets a bit messy with all the per-subfield error notifications, but for general purpose tools this level of detail should not be necessary.
We could contribute it in a less implementation-dependant way, and use it as a starting point for discussion. During the design process we already detected several ambiguities that led to the work done for #123 and #130.
Is there any progress or update on the topic?
As mentioned in #88, I have been working on a VCF grammar, tested against 4.1 and 4.2 files. It may be a bit more permissive than what #88 describes because some important datasets have defined "creative" IDs, with characters that soon were widely used as separators (like underscore and dot). This doesn't break anything else in the parsing process.
Apart from that, the most complicated part has been specifying meta values for entries that are almost free text (see "meta_single_value", "meta_field_value" and "meta_field_desc"). That section can be probably vastly improved. Some rules may also look a bit weird because they were written for a tool that requires priorities to be clearly defined. But I still think it can be a nice starting point :smile:
Formatting it inside a TeX document is a bit of nightmare, so I hope this is fine for an initial approach. Any comments will be welcome.