VCF grammar prototype - Githubissues

As mentioned in #88, I have been working on a VCF grammar, tested against 4.1 and 4.2 files. It may be a bit more permissive than what #88 describes because some important datasets have defined "creative" IDs, with characters that soon were widely used as separators (like underscore and dot). This doesn't break anything else in the parsing process.

Apart from that, the most complicated part has been specifying meta values for entries that are almost free text (see "meta_single_value", "meta_field_value" and "meta_field_desc"). That section can be probably vastly improved. Some rules may also look a bit weird because they were written for a tool that requires priorities to be clearly defined. But I still think it can be a nice starting point :smile:

Formatting it inside a TeX document is a bit of nightmare, so I hope this is fine for an initial approach. Any comments will be welcome.

#########################################################################################
# Main document structure: fileformat, then optional meta, header, then optional records
#########################################################################################

vcf                     := fileformat
                           (NL meta_entry)*
                            NL header
                           (NL record)* 
                           (NL)*;

#########################################################################################
# Fileformat line: sequence of alphanumeric and/or punctuation characters (like VCFv4.1)
#########################################################################################
fileformat              := '##fileformat=' (alnum | punct)+ ;

#########################################################################################
# Meta-data section: has some predefined entry types, but custom ones are also acceptable
#########################################################################################
meta_entry              := '##' ( 
                            'ALT=<'         meta_alt            '>'     |
                            'FILTER=<'      meta_filter         '>'     |
                            'FORMAT=<'      meta_format         '>'     |
                            'INFO=<'        meta_info           '>'     |
                            'contig=<'      meta_contig         '>'     |
                            'SAMPLE=<'      meta_sample         '>'     |
                            'PEDIGREE=<'    meta_pedigree       '>'     |
                            'pedigreeDB=<'  meta_pedigreeDB     '>'     |
                            'assembly='     meta_assembly               |
                            meta_typeid '=' 
                              (
                                ('<' meta_field (',' meta_field)* '>')  |
                                ('"' meta_field_desc '"')               |
                                ('<' meta_field_desc '>')               |
                                meta_single_value
                              )
                          ) ;

meta_alt                := 'ID=' identifier_alt ',Description="' meta_field_desc '"'
                           (',' identifier '="' meta_field_desc '"')* ;

meta_assembly           := url ;

meta_contig             := 'ID=' chromosome_basic (',' identifier '=' meta_single_values)* ;

meta_filter             := 'ID=' identifier ',Description="' meta_field_desc '"' 
                           (',' identifier '="' meta_field_desc '"')* ;

meta_format             := 'ID=' identifier 
                           ',Number=' meta_number  
                           ',Type=' meta_format_type 
                           ',Description="' meta_field_desc '"' 
                           (',' identifier  '="' meta_field_desc '"')* ;

meta_info               := 'ID=' identifier 
                           ',Number=' meta_number 
                           ',Type=' meta_info_type
                           ',Description="' meta_field_desc '"'
                           (',' identifier '="' meta_field_desc '"')* ;

meta_pedigree           := identifier '=' identifier 
                           (',' identifier '=' identifier)* ;

meta_pedigreeDB         := url ;

meta_sample             := 'ID=' identifier 
                           ',Genomes=' meta_field_value
                           ',Mixture=' meta_field_value
                           ',Description="' meta_field_desc '"' ;

meta_typeid             := (print - '=')+ - 
                           ('ALT' | 'FILTER' | 'FORMAT' | 'INFO' | 'assembly' | 
                            'contig' | 'SAMPLE' | 'PEDIGREE' | 'pedigreeDB') ;
meta_key                := alnum (alnum | '.' | '_' | '-')+ ;

meta_field              := meta_key '=' meta_values ;
meta_values             := ('"' meta_field_desc '"') | meta_field_value ;

# A value in a meta-data entry can be expressed in multiple ways:
# - The line is a single key-value pair (meta_single_value): the value cannot start 
#   with an angle bracket, quote or linebreak, but it is free text thereafter
# - One of multiple key-value pairs, non-quoted (meta_field_value): the value cannot 
#   contain a comma -end of field-, or a closing angle bracket -end of entry-, or quote
# - One of multiple key-value pairs, quoted (meta_field_desc)
meta_single_value       := ((print - ('<' | '"' | NL)) (print)*) ;
meta_field_value        := (print - (',' | '>' | '"'))+ ;
meta_field_desc         := (print)+ ;

meta_number             := ((digit)+ | 'A' | 'R' | 'G' | '.') ;
meta_format_type        := 'Integer' | 'Float' | 'Character' | 'String' ;
meta_info_type          := 'Integer' | 'Float' | 'Flag' | 'Character' | 'String' ;

#########################################################################################
# Header line between meta and records
#########################################################################################
sample_name             := (alnum | punct)+ ;
header                  := '#CHROM' CS 'POS' CS 'ID' CS 'REF' CS 'ALT' CS 'QUAL' 
                           CS 'FILTER' CS 'INFO' (CS 'FORMAT' (CS sample_name)+ )? ;

#########################################################################################
# Records
#########################################################################################
record                  := record_chromosome CS record_position CS record_id 
                           CS record_reference CS record_alt CS record_qual 
                           CS record_filter CS record_info 
                           (CS record_format (CS record_sample)+)? ;

# A chromosome must be a string with no white-spaces or colons, 
# and may be surrounded by < > symbols (for contigs)
chromosome_basic        := alnum (alnum | punct - (':' | '<' | '>' | ','))* ;
chromosome_contig       := '<' chromosome_basic '>' ;
chromosome              := chromosome_basic | chromosome_contig ;

position                := int_number ;

record_chromosome       := chromosome ;

record_position         := position ;

# ID must be a (list of) string with no white-spaces or semi-colons
record_id_value         := (print - (space | ';'))+ ;
record_id               := record_id_value (';' record_id_value)* | '.' ;

record_reference        := bases ;

# Main alternate allele rule
record_alt              := record_alt_value (',' record_alt_value)* | '.' ;
record_alt_value        := record_alt_snv | 
                           record_alt_indel | 
                           record_alt_sv | 
                           record_alt_other ;

record_alt_snv          := (bases)+ ;

# Indel alternates can be represented by standardized prefixes or an asterisk
record_alt_indel        := '<DEL>' | '<INS>' | '<DUP>' | '<INV>' | '<CNV>' | 
                           '<DUP:TANDEM>' | '<DEL:ME:' (alnum)+ '>' | 
                           '<INS:ME:' (alnum)+ '>' | '*';

# Structural variants follow forms like:
# ]1:1234]ATG or ]<contig_1>:1234]ATG : paired breakends
# .AGT, AGT.: single breakends
record_alt_sv           := ']' chromosome ':' position ']' (bases)+ |
                           '[' chromosome ':' position '[' (bases)+ |
                           (bases)+ ']' chromosome ':' position ']' |
                           (bases)+ '[' chromosome ':' position '[' | 
                           '.' bases | 
                           bases '.' ;

# Custom alternates can be any identifier surrounded by < > symbols
record_alt_other        := ('<' identifier_alt '>') - 
                           ('<DEL>' | '<INS>' | '<DUP>' | '<INV>' | '<CNV>' | 
                           '<DUP:TANDEM>' | '<DEL:ME:' (alnum)+ '>' | 
                           '<INS:ME:' (alnum)+ '>');

record_qual             := fp_number | '.' ;

record_filter           := (filter_value (';' filter_value)*) | '.' ;
filter_value            := (alnum | punct - ';')+ - (punct)+ ;

record_info             := info_entry (';' info_entry)* | '.' ;
info_entry              := 'AA='       (bases | '.' | '-')             |
                           'AC='       int_number (',' int_number)*    |
                           'AF='       fp_number (',' fp_number)*      |
                           'AN='       int_number                      |
                           'BQ='       fp_number                       |
                           'CIGAR='    (alnum)+                        |
                           'DB'        ('=' ('1' | '0'))?              |
                           'DP='       int_number                      |
                           'END='      int_number                      |
                           'H2'        ('=' ('1' | '0'))?              | 
                           'H3'        ('=' ('1' | '0'))?              |
                           'MQ='       fp_number                       |
                           'MQ0='      int_number                      |
                           'NS='       int_number                      |
                           'SB='       fp_number                       |
                           'SOMATIC'   ('=' ('1' | '0'))?              |
                           'VALIDATED' ('=' ('1' | '0'))?              |
                           '1000G'     ('=' ('1' | '0'))?              |
                           info_key                                    | 
                           info_key '=' info_value_list ;

# Custom INFO keys may contain punctuation symbols, but not only those
info_key                := alnum (alnum | (punct - (';' | '=')))+ - 
                           ( 'AA' | 'AC' | 'AF' | 'AN' | 'BQ' | 'CIGAR' | 'DB' | 
                           'DP' | 'END' | 'H2' | 'H3' | 'MQ' | 'MQ0' | 'NS' | 'SB' | 
                           'SOMATIC' | 'VALIDATED' | '1000G' ) ;
info_value_list         := info_value (',' info_value)* ;
info_value              := (print - (space | ';'))+ ;

# Accepting non-alphanumeric characters is an addition in v4.3, 
# but widely used in already existing files
record_format           := format_value (':' format_value)* ;
format_value            := alpha (alnum | '_' | '.')* ;

# In a sample, the genotype is mandatory and must be the first field
record_sample           := sample_gt (':' sample_value)* ;
sample_gt               := (sample_allele (('/' | '|') sample_allele)*) ;
sample_allele           := (digit)+ | '.' ;
sample_value            := (alnum | (punct - ':'))+ | '.' ;

#########################################################################################
# Representations for basic types and elements (separators, numbers, etc)
#########################################################################################
NL                      := '\n' | '\n\r' | '\r' ;
CS                      := '\t' ;

alnum                   := (alpha | digit)+ ;

int_number              := ('+'|'-')? (digit)+ ;
fp_number               := ('+'|'-')? (digit)+ ('.' (digit)+)? 
                           (('e'|'E') ('+'|'-')? (digit)+)? ;            

# Bases are case-insensitive
bases                   := ('A' | 'C' | 'G' | 'T' | 'N' | 'a' | 'c' | 't' | 'g' | 'n' )+ ;

# Identifier to use as keys in meta entries
identifier              := (alnum | '.' | '_' | '-' )+ - (punct)+ ;
# Identifier to use in ALT meta entries or in the ALT column
identifier_alt          := (alnum | ':' | '_' )+       - (punct)+ ;
## TODO Identifiers could contain some symbols, but which ones exactly? 
## Some widely used are dot, underscore and dash

# URL inspired in 
# http://stackoverflow.com/questions/8784903/failed-to-convert-url-parser-regular-expression-to-ragel
url                     := (scheme '://' authority path ('?' query)? ('#' fragment)?) ;
scheme                  := (alpha (any - [:/?#] - '\n')+ ) ;
authority               := (alpha (any - [/?#] - '\n')* ) ;
path                    := (any - [?#] - '\n')* ;
query                   := (any - [#] - '\n')* ;
fragment                := (any - '\n')* ;

I think we should revive this effort! Having a formal grammar for the VCF (and BAM/SAM) specs would solve many of the issues we've historically had with ambiguities in the wording of the spec documents, and eliminate the need to update our parsers for each version of the spec (since we could generate our parsers directly from the grammars).

@jmarshall @cyenyxe @mp15 @vadimzalunin @lbergelson @lh3 @akiezun @yfarjoun @tfenne What do you think?

Using grammars to define the file formats would be a huge improvement over our current state. It would make it much easier for different tools to offer compatible support of the spec, would make it easier to support multiple versions of the spec simultaneously, and would avoid a lot of ambiguities. ( Although it would be more complicated to make sure that a more human readable version of the spec actually matches the grammar of the spec.)

:+1: I would like to maintain a plaintext version. In the case of any ambiguity we would take the grammar to be correct.

@lbergelson's note is worth noting since I am worried that there will be more work to do maintaining and I'm not sure that the benefits are that great.

On Fri, Jun 10, 2016 at 1:19 PM, ldgauthier notifications@github.com wrote:

👍 I would like to maintain a plaintext version. In the case of any ambiguity we would take the grammar to be correct.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/93#issuecomment-225242558, or mute the thread https://github.com/notifications/unsubscribe/ACnk0h9-iBa7kAKu0g2YrRyRIv8qrGawks5qKZyUgaJpZM4FK5Q1 .

The benefits as I see them are:

Reduced ambiguitiy in the spec itself
Potentially easier updates as the spec changes
Consistency of parsing across languages: We could automatically generate identical parsers for c, c++, java, python, etc automatically with minimal effort.
Ease of generating separate parsers for different versions of the spec
Identifying non-compliant files would be easier.

The drawbacks:

We have to define an unambiguous spec which may not be trivial
Increased complexity due to the need to understand the grammar, parser generator, etc
Automatically generated parsers may be slower than what we're using now
It may be non-trivial to connect the output of a generated parser to the various internal data structures htsjdk and htslib uses.
Possibility of differences in a human readable spec description and the actual computer readable spec.
Supporting non-compliant files may be harder.
Change is scary and hard.

@yfarjoun The benefits of being able to auto-generate a parser from the latest spec, and ensure consistency across different implementations (eg., htslib vs. htsjdk) are so great that I think it may be worth the up-front effort (which admittedly would be steep).

Reading Louis's list of drawbacks is scary. I am not sure at all that I agree with @droazen's assessment of worth...

On Fri, Jun 10, 2016 at 8:42 PM, droazen notifications@github.com wrote:

@yfarjoun https://github.com/yfarjoun The benefits of being able to auto-generate a parser from the latest spec, and ensure consistency across different implementations (eg., htslib vs. htsjdk) are so great that I think it may be worth the up-front effort (which admittedly would be steep).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/93#issuecomment-225326210, or mute the thread https://github.com/notifications/unsubscribe/ACnk0sSF0ChmdYJJv9QkPip_uxSpA4vmks5qKgRwgaJpZM4FK5Q1 .

Consider the status quo, though:

-htsjdk does not support the latest VCF and BCF specs -backwards compatibility with previous versions of the specs is spotty at best -htsjdk and htslib parsers behave differently in many cases (esp. corner cases) -legal values for most tokens are poorly defined (or not defined at all), and when rules do exist they are often spottily enforced -each time a new version of the spec is released, developers across multiple projects must identify the changes (which itself is not always easy, given the textual format of the spec), implement them manually, determine whether older versions of the files can still be parsed correctly, etc. -those involved in drafting the specs spend a lot of time debating things like should vs. may vs. must, dealing with wording that could be misinterpreted or does not reflect reality, etc., when there are mechanisms available to define all these things precisely

What about strict versus non-strict validation? will you have a different grammer for the different strictness?

On Fri, Jun 10, 2016 at 10:16 PM, droazen notifications@github.com wrote:

Consider the status quo, though:

-htsjdk does not support the latest VCF and BCF specs -backwards compatibility with previous versions of the specs is spotty at best -htsjdk and htslib parsers behave differently in many cases (esp. corner cases) -legal values for most tokens are poorly defined (or not defined at all), and when rules do exist they are often spottily enforced -each time a new version of the spec is released, developers across multiple projects must identify the changes (which itself is not always easy, given the textual format of the spec), implement them manually, determine whether older versions of the files can still be parsed correctly, etc. -those involved in drafting the specs spend a lot of time debating things like should vs. may vs. must, dealing with wording that could be misinterpreted or does not reflect reality, etc., when there are mechanisms available to define all these things precisely

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/93#issuecomment-225331403, or mute the thread https://github.com/notifications/unsubscribe/ACnk0or7Ulorz58r36z7xT6CvWEMOsWjks5qKhqJgaJpZM4FK5Q1 .

I think strictly defining the allowed characters of each field is a good thing, if this has not been done. However, a full-blown grammar is overkilling and actually not very useful. A grammar will still leave uncertainties. For example, in one version of 1000g VCF, there were lines like these:

##INFO=<ID=HOMSEQ,Number=.,Type=String>
20  14370   rs6054257   G   A   29.1    .   NS=3;DP=14;AF=0.5;HOMSEQ;DB;STRVEC=a,b  GT:GQ:DP:HQ:CNL     1|0:48:8:49,51:.

Notably, HOMSEQ is a variable length array, but on the variant line, it appears to be a Flag type. There is no way for a context-free grammar to resolve this ambiguity. And as we can't fully describe VCF with grammar alone anyway, my preference is to describe VCF purely with regular language (i.e. the language expressed using regular expressions) which can identify invalid VCFs within the limit of regular grammar.

At the end of this comment, you can find a lex program that accepts VCF lines. You may save the code as vcf.l, compile and run it with:

lex vcf.l; gcc lex.yy.c; ./a.out < input.vcf

It will report errors the regular grammar can identify. You can probably make a good guess about the lex syntax: a lex program defines sub-expressions and uses them in longer regular expressions later on. The current grammar is too permissive and does not conform to the VCF spec. We may consider to adapt it and to make it stricter, within the limit of regular grammar.

%{
#include <stdio.h>
int lineno = 0;
%}

sign    \+|\-
int     [0-9]+
float   ({int}?"."{int}([eE]{sign}?{int})?)|({int}[eE]{sign}?{int})
str     [A-Za-z0-9_:;/\|\.\-,"']+
sym     [A-Za-z_][A-Za-z0-9_\.]*
base    [ACGTN]
seq     {base}+
allele  {seq}|<{sym}>
keyval  ({sym}=[A-Za-z0-9_:\|,\.\-"']+)|{sym}

chrom   {str}
pos     {int}
id      {str}
ref     {seq}
alt     {allele}(,{allele})*
qual    {int}|{float}|\.
filter  \.|({sym}(;{sym})*)
info    \.|({keyval}(;{keyval})*)
format  {sym}(:{sym})*
sample  {str}

bline   {chrom}\t{pos}\t{id}\t{ref}\t{alt}\t{qual}\t{filter}\t{info}(\t{format}(\t{sample})+)?
lline   #CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO(\tFORMAT(\t{str})+)?

def     <.*>
miscdef [^<>\n]+

dline   ##{sym}=({def}|{miscdef})

%%

^{dline}\n  ++lineno; return 1;
^{lline}\n  ++lineno; return 2;
^{bline}\n  ++lineno; return 3;
^.*\n       ++lineno; fprintf(stderr, "Parse error at line %d:\n%s", lineno, yytext); return -1;

%%

int yywrap(void) { return 1; }

int main(void)
{
    while (yylex() != 0);
    return 0;
}

The fact that a grammar/regular language exists doesn't mean that all tools will have to implement it as such. But having it described in the spec will help to:

Define the allowed characters for each field, as @lh3 mentions
Spot cases that could break any type of parser due to ambiguity, including existing ones
Improve compatibility at full compliance level.

Those who need to support non-compliant files would at least have the chance to implement some checks in a way that is still compatible with other tools. Removing the differences between htslib/htsjdk that @droazen cites could be really beneficial for the community, even when considered just from a user perspective.

If any of you want to explore alternatives to the traditional lex/yacc combination, my team has been using Ragel for a while to implement a VCF validator (the need for a non-ambiguous definition is obvious here), and we are quite happy with the result. This doesn't report context-based errors, but a "semantic validation" component can do it.

I think we've gotten pretty far, as you can see in: https://github.com/EBIvariation/vcf-validator/blob/feature/multi-vcf-versions/src/vcf/vcf_v42.ragel#L506 (v4.3 is under construction). It gets a bit messy with all the per-subfield error notifications, but for general purpose tools this level of detail should not be necessary.

We could contribute it in a less implementation-dependant way, and use it as a starting point for discussion. During the design process we already detected several ambiguities that led to the work done for #123 and #130.

Is there any progress or update on the topic?

samtools / hts-specs

VCF grammar prototype #93