kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
3.99k stars 194 forks source link

Formal expresion language description? #775

Open Mingun opened 4 years ago

Mingun commented 4 years ago

I do not see it on https://kaitai.io. I'm correctly understant, that expression parser is defined in https://github.com/kaitai-io/kaitai_struct_compiler/blob/0f32f3734dad0039dffb2275d38612eb779689ec/shared/src/main/scala/io/kaitai/struct/exprlang/Expressions.scala?

dgelessus commented 4 years ago

The main documentation for the expression language is in the user guide. It's not a full formal specification (there's no syntax grammar for example), but it's quite detailed and explains almost every feature of the expression language.

Mingun commented 4 years ago

I inferred a formal PEG specification from Scala parser, but I don't found, how whitespaces is handled? Corresponding rule in parser commented, but spaces accepted by parser. How it does that?

That specification follow syntax of my fork of pegjs project (its used a range syntax, that is missing in original project).

It slightly differ from Scala parser for greater visibility.

// Entry point for
// - `size`, `if`, `parent`, `value`, `pos`, `io`, `repeat-expr`, `repeat-until`, `switch-on`, `process: xor|zlib|rol|ror`
// - `type.cases` keys (O_o)
// - `min`, `max`, `expr` in new `valid` key
topExpr = expr EOF;

// Entry point for `process: custom(arg1, arg2, ...)` and user parametrized types (with `params`) keys.
// Parses arguments of function/user type
topExprList = expr|1.., ","| EOF;

// Whitespaces
//_ = ([ \n]+ / "\\\n")*

EOF = !.;

string
  = "'" (!"'")* "'"
  / '"' ([^\\"]* / escaped)* '"'
  ;
escaped = "\\" (quotedChar / quotedOct / quotedHex);
quotedChar = [abtnvfre'"\\];// characters that can be escaped by backslash
quotedOct  = oct+;
quotedHex  = "u" hex|4|;

digit = [0-9];

integer
  = [1-9] (digit / "_")*
  / "0" [oO] oct+
  / "0" [xX] hex+
  / "0" [bB] bin+
  / "0"
  ;
oct = "_" / [0-7];
bin = "_" / [01];
hex = "_" / digit / [a-fA-F];

float
  = digit+ exponent   // Ex.: 4E2, 4E+2, 4e-2
  / fixed exponent?   // Ex.: 4.E2, .4e+2, 4.2e-0
  ;
fixed
  = digit* "." digit+ // Ex.: 4.2, .42
  / digit+ "."        // Ex.: 42.
  ;
exponent = [eE] [+-]? digit+;

//-------------------------------------------------------------------------------------------------

name = nameStart namePart*;
nameStart = [a-zA-Z_];
namePart  = nameStart / digit;

typeName = "::"? name|1.., "::"| ("[" "]")?;// Ex.: xyz, ::abc::def, array[]
enumName = "::"? name|2.., "::"|;           // Ex.: enum::value, ::root::type::enum::value

//-------------------------------------------------------------------------------------------------

OR  = "or"  !namePart;
AND = "and" !namePart;
NOT = "not" !namePart;

expr     = or_test ("?" expr ":" expr)?
or_test  = and_test|1.., OR |;
and_test = not_test|1.., AND|;

not_test
  = NOT not_test
  / or_expr (comp_op or_expr)?
  ;

comp_op
  = "=="
  / "!="
  / "<>"
  / "<="
  / ">="
  / "<"
  / ">"
  ;

or_expr    = xor_expr  |1.., "|"          |;
xor_expr   = and_expr  |1.., "^"          |;
and_expr   = shift_expr|1.., "&"          |;
shift_expr = arith_expr|1.., ("<<" / ">>")|;
arith_expr = term      |1.., [+-]         |;
term       = factor    |1.., [*/%]        |;

factor
  = "+" factor
  / "-" factor
  / "~" factor  // bitwise negation
  / atom postfix*
  ;

atom
  = "(" expr ")"
  / "[" list? "]"
  / "sizeof" "<" typeName ">"
  / "bitsizeof" "<" typeName ">"
  / enumName
  / name
  / string+ // miltiply strings concatenated
  / float
  / integer
  ;

postfix
  = "(" args ")"              // call
  / "[" expr "]"              // indexing
  / "." "as" "<" typeName ">" // type cast
  / "." name                  // attribute access
  ;

list = expr|1.., ","| ","?;
args = expr| .., ","|;
GreyCat commented 4 years ago

I agree that the only formal specification available now is reference compiler source code, which is obviously bad. Kudos for this effort with transcribing this to PEG!

For whitespace, as far as I can tell, it is handled by magic in FastParse: https://www.lihaoyi.com/fastparse/#WhitespaceHandling

Essentially, it injects the possibility to have a whitespace between every two consecutive literals. This "whitespace" also consumes Python-style comments, if my memory serves.