opencybersecurityalliance / kestrel-lang

Kestrel threat hunting language: building reusable, composable, and shareable huntflows across different data sources and threat intel.
Apache License 2.0
298 stars 50 forks source link

Update the terminals in grammar file with lower cases to workaround the bug of `lark-js` #371

Closed jillyj closed 1 year ago

jillyj commented 1 year ago

Is your feature request related to a problem? Please describe. When generating the JS parser for Kestrel grammar, I encountered an issue while using the generated parser. It throws "invalid regular expression" error. Please refer to the bug I opened in lark-js repo. https://github.com/lark-parser/Lark.js/issues/21

Describe the solution you'd like The workaround is to change some terminals from upper case to lower case which makes the parser working.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

jillyj commented 1 year ago

Here is the updated grammar that lark-js could generate correct parser.

//
// Kestrel Grammar (version 1.7.0 - 2023/06/15)
//

//
// A huntflow is a sequence of statements
//

start: statement*

statement: assignment
         | command_no_result

// If no VARIABLE is given, default to _ in post-parsing
// For assign or merge, the result variable is required
// This eliminates meaningless huntflows like `var1 var2 var3`
assignment: VARIABLE "=" expression
          | VARIABLE "=" VARIABLE ("+" VARIABLE)+
          | (VARIABLE "=")? command_with_result

// "?" at the beginning will inline command
?command_with_result: find
                    | get
                    | group
                    | join
                    | load
                    | new
                    | sort

?command_no_result: apply
                  | disp
                  | info
                  | save

//
// All commands
//

find: "FIND"i ENTITY_TYPE RELATION (REVERSED)? VARIABLE where_clause? timespan?

get: "GET"i ENTITY_TYPE ("FROM"i datasource)? where_clause timespan?

group: "GROUP"i VARIABLE BY grp_spec ("WITH"i agg_list)?

join: "JOIN"i VARIABLE "," VARIABLE (BY ATTRIBUTE "," ATTRIBUTE)?

load: "LOAD"i stdpath ("AS"i ENTITY_TYPE)?

new: "NEW"i ENTITY_TYPE? var_data

sort: "SORT"i VARIABLE BY ATTRIBUTE (ASC|DESC)?

apply: "APPLY"i analytics_uri "ON"i variables ("WITH"i args)?

disp: "DISP"i expression

info: "INFO"i VARIABLE

save: "SAVE"i VARIABLE "TO"i stdpath

//
// Variable definition
//

variables: VARIABLE ("," VARIABLE)*

VARIABLE: CNAME

//
// Expression
//

expression: vtrans where_clause? attr_clause? sort_clause? limit_clause? offset_clause?

// not use rule name `transform` since it is a special function in Lark
// the function in transformer will mal-function in `merge_transformers()`
vtrans: transformer "(" VARIABLE ")"
      | VARIABLE

transformer: TIMESTAMPED
           | ADDOBSID

TIMESTAMPED: "TIMESTAMPED"i

ADDOBSID: "ADDOBSID"i

where_clause: "WHERE"i ecg_pattern
attr_clause: "ATTR"i ATTRIBUTES
sort_clause: "SORT"i BY ATTRIBUTE (ASC|DESC)?
limit_clause: "LIMIT"i INT
offset_clause: "OFFSET"i INT

?ecg_pattern: disjunction
            | "[" disjunction "]" // STIX compatible

?disjunction: conjunction
            | disjunction "OR"i conjunction -> expression_or

?conjunction: comparison
            | conjunction "AND"i comparison -> expression_and

?comparison: comparison_std
           | comparison_null
           | "(" disjunction ")"

comparison_std:  ENTITY_ATTRIBUTE_PATH op      value
comparison_null: ENTITY_ATTRIBUTE_PATH null_op NULL

//
// Timespan
//

?timespan: "start"i timestamp "stop"i timestamp -> timespan_absolute
         | "last"i INT timeunit                 -> timespan_relative

?timeunit: day
         | hour
         | minute
         | second

day: "days"i | "day"i | "d"i
hour: "hours"i | "hour"i | "h"i
minute: "minutes"i | "minute"i | "m"i
second: "seconds"i | "second"i | "s"i

timestamp:       isotimestamp
         | "\""  isotimestamp "\""
         | "'"   isotimestamp "'"
         | "t\"" isotimestamp "\""
         | "t'"  isotimestamp "'"

isotimestamp: /\d{4}-[01]\d-[0-3]\dT[0-2]\d:[0-5]\d:[0-5]\d(\.\d+)?Z/

//
// FIND command constructs
//

RELATION: WORD

//
// GROUP command constructs
//

grp_spec: grp_expr ("," grp_expr)*

grp_expr: ATTRIBUTE
        | bin_func

bin_func: "BIN"i "(" ATTRIBUTE "," INT timeunit? ")"
// No other scalar funcs are supported yet

agg_list: agg ("," agg)*

agg: funcname "(" ATTRIBUTE ")" ("AS"i alias)?

?funcname: (MIN|MAX|SUM|AVG|COUNT|NUNIQUE)
MIN: "MIN"i
MAX: "MAX"i
SUM: "SUM"i
AVG: "AVG"i
COUNT: "COUNT"i
NUNIQUE: "NUNIQUE"i

?alias: ECNAME

//
// GET command constructs
//

datasource: DATASRC_SIMPLE
          | DATASRC_ESCAPED
          | VARIABLE

DATASRC_SIMPLE: PATH_SIMPLE ("," PATH_SIMPLE)*
DATASRC_ESCAPED: PATH_ESCAPED

//
// APPLY command constructs
//

analytics_uri: ANALYTICS_SIMPLE
             | ANALYTICS_ESCAPED

ANALYTICS_SIMPLE: PATH_SIMPLE
ANALYTICS_ESCAPED: PATH_ESCAPED

//
// Two-level JSON in command NEW
//

// use terminal to load the entire var_data without parsing into it
// add `WS*` since `%ignore WS` doesn't apply to spaces inside terminals
// https://github.com/lark-parser/lark/issues/99
var_data: "[" (RAW_VALUES | json_objs) "]"

RAW_VALUES: ESCAPED_STRING_WS ("," ESCAPED_STRING_WS)*

json_objs: json_obj ("," json_obj)*
json_obj: WS* "{" json_pair ("," json_pair)* "}" WS*
json_pair: ESCAPED_STRING_WS ":" json_value
json_value: WS* (NUMBER|ESCAPED_STRING|TRUE|FALSE|NULL) WS*

//
// Arguments
//

args: arg_kv_pair ("," arg_kv_pair)*

arg_kv_pair: ECNAME "=" value

//
// Shared keywords
//

BY: "BY"i
ASC: "ASC"i
DESC: "DESC"i
REVERSED: "BY"i
TRUE: "TRUE"i
FALSE: "FALSE"i
NULL: "NULL"i
IN: "IN"i
LIKE: "LIKE"i
MATCHES: "MATCHES"i
IS: "IS"i
NOT: "NOT"i
ISSUBSET: "ISSUBSET"i
ISSUPERSET: "ISSUPERSET"i

op: op_sign
  | (NOT WS+)? op_keyword

op_sign: "="
       | "=="
       | "!="
       | ">"
       | "<"
       | ">="
       | ">="

op_keyword: IN
          | LIKE
          | MATCHES
          | ISSUBSET
          | ISSUPERSET

null_op: IS (WS+ NOT)?

//
// Common language constructs
//

value: literal_list
     | literal

literal: reference_or_simple_string
       | string
       | number

literal_list: "(" literal ("," literal)* ")"
            | "[" literal ("," literal)* "]"

reference_or_simple_string: ECNAME ("." ATTRIBUTE)?

string: advanced_string

number: NUMBER

ENTITY_ATTRIBUTE_PATH: (ENTITY_TYPE ":")? ATTRIBUTE

ENTITY_TYPE: ECNAME

stdpath: PATH_SIMPLE
       | PATH_ESCAPED

// TODO: support attributes without quote for dash
//       x.hash.SHA-256 instead of x.hash.'SHA-256'
ATTRIBUTE: ECNAME "[*]"? ("." ECNAME_W_QUOTE)*
ATTRIBUTES: ATTRIBUTE (WS* "," WS* ATTRIBUTE)*

ECNAME: (LETTER|"_") (LETTER|DIGIT|"_"|"-")*
ECNAME_W_QUOTE: (LETTER|DIGIT|"_"|"-"|"'")+

PATH_SIMPLE: (ECNAME "://")? (LETTER|DIGIT|"_"|"-"|"."|"/")+

PATH_ESCAPED: "\"" (ECNAME "://")? _STRING_ESC_INNER "\""
            | "'"  (ECNAME "://")? _STRING_ESC_INNER "'"

ESCAPED_STRING: "\"" _STRING_ESC_INNER "\""
              | "'"  _STRING_ESC_INNER "'"
ESCAPED_STRING_WS: WS* ESCAPED_STRING WS*

SIMPLE_STRING: ECNAME

// nearly Python string, but no [ubf]? as prefix options
// check Lark example of Python parser for reference
advanced_string: /(r?)("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/

%import common (LETTER, DIGIT, WS, INT, WORD, NUMBER, CNAME, _STRING_ESC_INNER)
%import common.SH_COMMENT -> COMMENT

%ignore WS
%ignore COMMENT
subbyte commented 1 year ago

Thanks for reporting this!

When changing the upper-letter name to lower-letter, a terminal goes into a rule (Lark definitions), and we probably need a little bit code in kestrel/syntax/parser.py to handle them.

Could you open a PR on it with the updates?

jillyj commented 1 year ago

Here is the PR: https://github.com/opencybersecurityalliance/kestrel-lang/pull/372. Thanks~

pcoccoli commented 1 year ago

@jillyj please have a look at PR #378