Update the terminals in grammar file with lower cases to workaround the bug of `lark-js`

jillyj commented 1 year ago

Is your feature request related to a problem? Please describe. When generating the JS parser for Kestrel grammar, I encountered an issue while using the generated parser. It throws "invalid regular expression" error. Please refer to the bug I opened in lark-js repo. https://github.com/lark-parser/Lark.js/issues/21

Describe the solution you'd like The workaround is to change some terminals from upper case to lower case which makes the parser working.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

jillyj commented 1 year ago

Here is the updated grammar that lark-js could generate correct parser.

//
// Kestrel Grammar (version 1.7.0 - 2023/06/15)
//

//
// A huntflow is a sequence of statements
//

start: statement*

statement: assignment
         | command_no_result

// If no VARIABLE is given, default to _ in post-parsing
// For assign or merge, the result variable is required
// This eliminates meaningless huntflows like `var1 var2 var3`
assignment: VARIABLE "=" expression
          | VARIABLE "=" VARIABLE ("+" VARIABLE)+
          | (VARIABLE "=")? command_with_result

// "?" at the beginning will inline command
?command_with_result: find
                    | get
                    | group
                    | join
                    | load
                    | new
                    | sort

?command_no_result: apply
                  | disp
                  | info
                  | save

//
// All commands
//

find: "FIND"i ENTITY_TYPE RELATION (REVERSED)? VARIABLE where_clause? timespan?

get: "GET"i ENTITY_TYPE ("FROM"i datasource)? where_clause timespan?

group: "GROUP"i VARIABLE BY grp_spec ("WITH"i agg_list)?

join: "JOIN"i VARIABLE "," VARIABLE (BY ATTRIBUTE "," ATTRIBUTE)?

load: "LOAD"i stdpath ("AS"i ENTITY_TYPE)?

new: "NEW"i ENTITY_TYPE? var_data

sort: "SORT"i VARIABLE BY ATTRIBUTE (ASC|DESC)?

apply: "APPLY"i analytics_uri "ON"i variables ("WITH"i args)?

disp: "DISP"i expression

info: "INFO"i VARIABLE

save: "SAVE"i VARIABLE "TO"i stdpath

//
// Variable definition
//

variables: VARIABLE ("," VARIABLE)*

VARIABLE: CNAME

//
// Expression
//

expression: vtrans where_clause? attr_clause? sort_clause? limit_clause? offset_clause?

// not use rule name `transform` since it is a special function in Lark
// the function in transformer will mal-function in `merge_transformers()`
vtrans: transformer "(" VARIABLE ")"
      | VARIABLE

transformer: TIMESTAMPED
           | ADDOBSID

TIMESTAMPED: "TIMESTAMPED"i

ADDOBSID: "ADDOBSID"i

where_clause: "WHERE"i ecg_pattern
attr_clause: "ATTR"i ATTRIBUTES
sort_clause: "SORT"i BY ATTRIBUTE (ASC|DESC)?
limit_clause: "LIMIT"i INT
offset_clause: "OFFSET"i INT

?ecg_pattern: disjunction
            | "[" disjunction "]" // STIX compatible

?disjunction: conjunction
            | disjunction "OR"i conjunction -> expression_or

?conjunction: comparison
            | conjunction "AND"i comparison -> expression_and

?comparison: comparison_std
           | comparison_null
           | "(" disjunction ")"

comparison_std:  ENTITY_ATTRIBUTE_PATH op      value
comparison_null: ENTITY_ATTRIBUTE_PATH null_op NULL

//
// Timespan
//

?timespan: "start"i timestamp "stop"i timestamp -> timespan_absolute
         | "last"i INT timeunit                 -> timespan_relative

?timeunit: day
         | hour
         | minute
         | second

day: "days"i | "day"i | "d"i
hour: "hours"i | "hour"i | "h"i
minute: "minutes"i | "minute"i | "m"i
second: "seconds"i | "second"i | "s"i

timestamp:       isotimestamp
         | "\""  isotimestamp "\""
         | "'"   isotimestamp "'"
         | "t\"" isotimestamp "\""
         | "t'"  isotimestamp "'"

isotimestamp: /\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d(\.\d+)?Z/

//
// FIND command constructs
//

RELATION: WORD

//
// GROUP command constructs
//

grp_spec: grp_expr ("," grp_expr)*

grp_expr: ATTRIBUTE
        | bin_func

bin_func: "BIN"i "(" ATTRIBUTE "," INT timeunit? ")"
// No other scalar funcs are supported yet

agg_list: agg ("," agg)*

agg: funcname "(" ATTRIBUTE ")" ("AS"i alias)?

?funcname: (MIN|MAX|SUM|AVG|COUNT|NUNIQUE)
MIN: "MIN"i
MAX: "MAX"i
SUM: "SUM"i
AVG: "AVG"i
COUNT: "COUNT"i
NUNIQUE: "NUNIQUE"i

?alias: ECNAME

//
// GET command constructs
//

datasource: DATASRC_SIMPLE
          | DATASRC_ESCAPED
          | VARIABLE

DATASRC_SIMPLE: PATH_SIMPLE ("," PATH_SIMPLE)*
DATASRC_ESCAPED: PATH_ESCAPED

//
// APPLY command constructs
//

analytics_uri: ANALYTICS_SIMPLE
             | ANALYTICS_ESCAPED

ANALYTICS_SIMPLE: PATH_SIMPLE
ANALYTICS_ESCAPED: PATH_ESCAPED

//
// Two-level JSON in command NEW
//

// use terminal to load the entire var_data without parsing into it
// add `WS*` since `%ignore WS` doesn't apply to spaces inside terminals
// https://github.com/lark-parser/lark/issues/99
var_data: "[" (RAW_VALUES | json_objs) "]"

RAW_VALUES: ESCAPED_STRING_WS ("," ESCAPED_STRING_WS)*

json_objs: json_obj ("," json_obj)*
json_obj: WS* "{" json_pair ("," json_pair)* "}" WS*
json_pair: ESCAPED_STRING_WS ":" json_value
json_value: WS* (NUMBER|ESCAPED_STRING|TRUE|FALSE|NULL) WS*

//
// Arguments
//

args: arg_kv_pair ("," arg_kv_pair)*

arg_kv_pair: ECNAME "=" value

//
// Shared keywords
//

BY: "BY"i
ASC: "ASC"i
DESC: "DESC"i
REVERSED: "BY"i
TRUE: "TRUE"i
FALSE: "FALSE"i
NULL: "NULL"i
IN: "IN"i
LIKE: "LIKE"i
MATCHES: "MATCHES"i
IS: "IS"i
NOT: "NOT"i
ISSUBSET: "ISSUBSET"i
ISSUPERSET: "ISSUPERSET"i

op: op_sign
  | (NOT WS+)? op_keyword

op_sign: "="
       | "=="
       | "!="
       | ">"
       | "<"
       | ">="
       | ">="

op_keyword: IN
          | LIKE
          | MATCHES
          | ISSUBSET
          | ISSUPERSET

null_op: IS (WS+ NOT)?

//
// Common language constructs
//

value: literal_list
     | literal

literal: reference_or_simple_string
       | string
       | number

literal_list: "(" literal ("," literal)* ")"
            | "[" literal ("," literal)* "]"

reference_or_simple_string: ECNAME ("." ATTRIBUTE)?

string: advanced_string

number: NUMBER

ENTITY_ATTRIBUTE_PATH: (ENTITY_TYPE ":")? ATTRIBUTE

ENTITY_TYPE: ECNAME

stdpath: PATH_SIMPLE
       | PATH_ESCAPED

// TODO: support attributes without quote for dash
//       x.hash.SHA-256 instead of x.hash.'SHA-256'
ATTRIBUTE: ECNAME "[*]"? ("." ECNAME_W_QUOTE)*
ATTRIBUTES: ATTRIBUTE (WS* "," WS* ATTRIBUTE)*

ECNAME: (LETTER|"_") (LETTER|DIGIT|"_"|"-")*
ECNAME_W_QUOTE: (LETTER|DIGIT|"_"|"-"|"'")+

PATH_SIMPLE: (ECNAME "://")? (LETTER|DIGIT|"_"|"-"|"."|"/")+

PATH_ESCAPED: "\"" (ECNAME "://")? _STRING_ESC_INNER "\""
            | "'"  (ECNAME "://")? _STRING_ESC_INNER "'"

ESCAPED_STRING: "\"" _STRING_ESC_INNER "\""
              | "'"  _STRING_ESC_INNER "'"
ESCAPED_STRING_WS: WS* ESCAPED_STRING WS*

SIMPLE_STRING: ECNAME

// nearly Python string, but no [ubf]? as prefix options
// check Lark example of Python parser for reference
advanced_string: /(r?)("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/

%import common (LETTER, DIGIT, WS, INT, WORD, NUMBER, CNAME, _STRING_ESC_INNER)
%import common.SH_COMMENT -> COMMENT

%ignore WS
%ignore COMMENT

subbyte commented 1 year ago

Thanks for reporting this!

When changing the upper-letter name to lower-letter, a terminal goes into a rule (Lark definitions), and we probably need a little bit code in kestrel/syntax/parser.py to handle them.

Could you open a PR on it with the updates?

jillyj commented 1 year ago

Here is the PR: https://github.com/opencybersecurityalliance/kestrel-lang/pull/372. Thanks~

pcoccoli commented 1 year ago

@jillyj please have a look at PR #378

opencybersecurityalliance / kestrel-lang

Update the terminals in grammar file with lower cases to workaround the bug of `lark-js` #371