opencybersecurityalliance / kestrel-lang

Kestrel threat hunting language: building reusable, composable, and shareable huntflows across different data sources and threat intel.
Apache License 2.0
297 stars 50 forks source link

Attribute auto-complete #79

Open subbyte opened 3 years ago

subbyte commented 3 years ago

Is your feature request related to a problem? Please describe. The current auto-complete function scans variable names (in the session), data source names, and analytics interface names. And it will be awesome if we can auto-complete entity attribute (and entity names also).

Advanced version: we may cache entity attributes in VarStruct for fast access.

vereimyst commented 1 year ago

Hello! I've been working on this issue for a bit now and I thought it would be good to post an update for where I'm sort of stuck on right now. You can find what I have so far for the attribute autocompletion feature here.


Firstly, I've somewhat broken down the issue to try and simplify the steps of work. The plan would be to submit separate pull requests, each addressing one of the following cases:

For the time being, I've been focusing on Case 1.


Some questions I had whilst working on this issue thus far:

For Case 1 in particular, it is possible for a user to declare a variable in, for example, a block in Jupyter Notebook and write a statement to display certain attribute information of said variable in the same block. This means that the variable cannot be assumed to be in scope automatically. I was planning to address this case by not offering suggestions entirely, unless other behavior is preferred instead.

# Jupyter Notebook example above
tmp = NEW process [ {"name": "potato"} ]
DISP tmp ATTR <autocomplete?>

-- I was also working on writing a function to parse the code and find the variable that the user is trying to get attributes for, but I wasn't sure how to limit the parsing to the beginning of a statement of code (e.g. DISP would be the beginning in the code block above). I was thinking that I could check until the last command called, but I wasn't sure if that was the most efficient method. I also haven't put much thought into how I would go about doing that quite yet.


The biggest problem I've run into is this (possible) bug that I've found. I though it would be best to confirm my understanding of this issue before I moved to create an issue for it. Assuming there are attributes that start with "n," when I hit tab with DISP browsers ATTR n in my code block (Jupyter Notebook), it appears that the parsing treats "n" as a full attribute value and moves onto the next field for autocompletion suggestions... while retaining "n" as the starting letter for said next field. These are the lines in the debug log pertaining to this occurrence:

14:46:47 DEBUG kestrel.session code="# display the information (attributes name, pid) of the entities in variable `browsers`
DISP browsers ATTR n" prefix="# display the information (attributes name, pid) of the entities in variable `browsers`
DISP browsers ATTR n" last_word="n"
14:46:47 DEBUG kestrel.session standard auto-complete
14:46:47 DEBUG kestrel.session first parse: [{'command': 'disp', 'input': 'browsers', 'transform': None, 'attrs': 'n'}]
14:46:47 DEBUG kestrel.session exception: [ERROR] KestrelSyntaxError: invalid character "@" at line 2 column 21, expects one of ['APPLY', 'SORT', 'GET', 'JOIN', 'NEW', 'TRANSFORM', 'LOAD', 'GROUP', 'INFO', 'LIMIT', 'OFFSET', 'FIND', 'SAVE', 'DISP', 'VARIABLE']
rewrite the failed statement.
14:46:48 DEBUG kestrel.session keywords: {'last', 'NEW', 'TO', 'START', 'CONTAINED', 'TIMESTAMPED', 'FROM', 'MAX', 'OR', 'ASC', 'SORT', 'null', 'count', 'BIN', 'info', 'disp', 'attr', 'sum', 'accepted', 'FIND', 'find', 'limit', 'bin', 'OWNED', 'OFFSET', 'or', 'linked', 'stop', 'save', 'as', 'DESC', 'LOADED', 'where', 'STOP', 'with', 'NULL', 'owned', 'GROUP', 'get', 'LIMIT', 'WITH', 'sort', 'max', 'ATTR', 'JOIN', 'INFO', 'ACCEPTED', 'to', 'new', 'ON', 'GET', 'COUNT', 'WHERE', 'SUM', 'group', 'LINKED', 'AND', 'contained', 'APPLY', 'on', 'created', 'timestamped', 'and', 'AS', 'min', 'offset', 'load', 'from', 'asc', 'apply', 'MIN', 'desc', 'DISP', 'by', 'AVG', 'BY', 'start', 'LOAD', 'CREATED', 'nunique', 'avg', 'loaded', 'LAST', 'NUNIQUE', 'SAVE', 'join'}
14:46:48 DEBUG kestrel.session token: APPLY
14:46:48 DEBUG kestrel.session token: SORT
14:46:48 DEBUG kestrel.session token: GET
14:46:48 DEBUG kestrel.session token: JOIN
14:46:48 DEBUG kestrel.session token: NEW
14:46:48 DEBUG kestrel.session token: TRANSFORM
14:46:48 DEBUG kestrel.session token: LOAD
14:46:48 DEBUG kestrel.session token: GROUP
14:46:48 DEBUG kestrel.session token: INFO
14:46:48 DEBUG kestrel.session token: LIMIT
14:46:48 DEBUG kestrel.session token: OFFSET
14:46:48 DEBUG kestrel.session token: FIND
14:46:48 DEBUG kestrel.session token: SAVE
14:46:48 DEBUG kestrel.session token: DISP
14:46:48 DEBUG kestrel.session token: VARIABLE
14:46:48 DEBUG kestrel.session ['TIMESTAMPED', '_', 'apply', 'browsers', 'disp', 'find', 'get', 'group', 'info', 'join', 'limit', 'load', 'new', 'offset', 'proclist', 'save', 'sort'] -> ['ew']

I am assuming that this isn't intended behavior. I thought a possible cause of the issue was the space before @autocompletions@ in self.parse(prefix + " @autocompletions@"), but removing it didn't appear to change anything.


I believe that concludes my update for the time being. Please let me know if more details are required for anything I've mentioned!

subbyte commented 1 year ago

Very good problem description, @vereimyst ! Now you are 1/3 down the road---a full procedure of solving a problem has three phases: describing/formalizing the problem, figuring out a solution, and implementing it.

  1. You helped us understand the new feature better in your problem statement. Let's reformat it according to the recent syntax update:
  1. You are right that before a command get executed---the variable is established in a Kestrel session---the auto complete function cannot query the session.symtable to get the Kestrel variable. That means if the following is in one Jupyter cell, the auto complete function cannot obtain information of the variable tmp since it is not executed/established yet.

    tmp = NEW process [ {"name": "potato"} ]
    DISP tmp ATTR <autocomplete?

    Agree with you choice that in this case, the auto complete should give 0 results as it gets empty checking tmp in session.symtable.

  2. There are two approaches to locate the last statement in a multi-statement Jupyter cell (so you can get the variable from the last statement)

    1. The hack: you can enumerate the cases. For example, there are only two possible commands DISP and assign in Kestrel that could lead to auto completion of token == ATTRIBUTES. You can walk words in do_complete() back to find the first keyword DISP and = (need to have some logic here if DISP found or not to decide whether it is DISP command or assign command). Then go forward in words to locate the variable name according to the grammar defined for the two commands in syntax/Kestrel.lark.
    2. The systematic way: use the parser to help you recognize the previous statements and the last one. You may want to play with Lark a little bit to see how it behaves and what information you can get when giving it multiple statements with the last one incomplete. To do it, create a Python Notebook/script with the following toy parser code to understand how much information you can pull from the parser/Lark with previous statements (maybe you can try some Lark API that we haven't tried like InteractiveParser):
      
      from lark import Lark

grammar=r""" ?start: (disp|cmd)+ disp: "disp"i VARIABLE ("attr"i ATTRIBUTES)? cmd: "cmd"i "from"i PATHS "where"i PATHS ATTRIBUTE: CNAME ATTRIBUTES: ATTRIBUTE ("," WS+ ATTRIBUTE)* VARIABLE: CNAME PATHS: (LETTER|DIGIT|/[-_.:,]/)+ %import common (CNAME, LETTER, DIGIT, WS) %ignore WS """ g = Lark(grammar, parser='lalr')

2-statements program

pattern = r"DISP xyz ATTR pid, name, command_line CMD FROM xxx WHERE yyy"

3-statements program with the last one incomplete

pattern = r"DISP xyz ATTR pid, name, command_line CMD FROM xxx WHERE yyy CMD FROM"

ast = g.parse(pattern) print(ast)


4. Good catch of the problem. I just did an experiment with the toy code above with

pattern = r"DISP xyz ATTR n@auto@"

And the parser thinks `n` is the end of the first command `DISP`, and `@auto@ is the beginning of the second command:

DISP xyz ATTR n@auto@ ^ Expected one of:

This is an issue not only affecting the new feature you are adding. For any auto complete, our current logic in do_complete() will not work when a user is auto completing the last token in a statement when the prefix of the last token typed is already a valid token for that, e.g., n is a valid ATTRIBUTES token.

Please report/describe this in a new issue (bug issue) and you may want to work on this new issue first. Thanks again for finding the issue. As I mentioned, this is a more general issue than ATTRIBUTES auto completion. That means fixing it will likely upgrade our logic in do_complete(). You can think about possible solutions when writing the issue, and a naive idea is to run the parse() function twice to better understand what needs to be auto completed: one on prefix + " @autocompletions@", and one on prefix - last word. Maybe you come up with a more elegant solution than our hack of using prefix + " @autocompletions@" :-)

  1. For case 2, @pcoccoli have a smart idea on where to get the attribute info. Besides getting the attributes, it requires you to parse the last statement of GET or FIND to get entity type (similar to parsing variable you are doing for case 1), so let's work on a separate PR for it after case 1 PR, after the new issue PR.
pcoccoli commented 1 year ago

stix-shifter has a mapping command (so presumably an API call) that returns a connector's mapping. We can parse the from_stix_map portion of that to help autocomplete STIX patterns. If we can parse the partial statement (I think this is the main issue - I'm not sure how to get the AST of an incomplete statement with lark) and get the return type and data source, then we can get the complete list of STIX object paths that can be used in a pattern for that datasource. Note that this is completely different than autocompleting properties for a variable. In that case, we need to get the variable name, then find its type, and check the corresponding database table.

pcoccoli commented 1 year ago

I'm not sure if this is helpful, but maybe instead of adding @autocompletions@ and re-parsing, we could start looking at the parsed statements and completing from that. E.g.:

In [7]: result = parse('tmp = NEW process [ {"name": "potato"} ]\nDISP tmp ATTR n')

In [8]: result
Out[8]: 
[{'command': 'new',
  'type': 'process',
  'data': '[ {"name": "potato"} ]',
  'output': 'tmp'},
 {'command': 'disp', 'input': 'tmp', 'transform': None, 'attrs': 'n'}]

We can see here that the last statement is a disp, and I think we know the last_word is "n". From that can we infer that we're looking for an attribute of tmp that starts with "n"? That means stepping back to the previous statement (in this case), recognizing that tmp is being defined there, so we need to json.loads data and look for a key starting with "n".

pcoccoli commented 1 year ago

Also, a friendly reminder that there's a unit test for autocomplete in tests/test_completion.py - you should add these cases first and see them fail, then implement the change until the new cases (and all the old cases) pass.

vereimyst commented 1 year ago

It actually just occurred to me that while running the auto-complete function with the current bugged aspect of treating the "n" as a completed ATTRIBUTES field, there is this line in the debug log that says this:

16:48:51 DEBUG kestrel.session standard auto-complete
16:48:51 DEBUG kestrel.session first parse: [{'command': 'disp', 'input': 'conns', 'transform': None, 'attrs': 'n'}]
16:48:51 DEBUG kestrel.session exception: [ERROR] KestrelSyntaxError: invalid character "@" at line 1 column 18, expects one of ['APPLY', 'JOIN', 'OFFSET', 'TRANSFORM', 'SAVE', 'GROUP', 'LOAD', 'NEW', 'VARIABLE', 'DISP', 'INFO', 'GET', 'SORT', 'LIMIT', 'FIND']

which I believe fits that output format you included above. I didn't see the first parse portion in the call where the autocompletion works correctly (as shown below). I was wondering why it wasn't going through the try loop and if there might be some way to force it to go through the try loop? I thought it would be easiest since I could just refer to the input of the last statement parsed to get the variable in question, assuming the statement follows correct Kestrel statement (is it necessary to add a checking statement if the value is a variable?).

16:26:17 DEBUG kestrel.session standard auto-complete
16:26:17 DEBUG kestrel.session exception: [ERROR] KestrelSyntaxError: invalid token "" at line 1 column 12, expects "ATTRIBUTES"

If not, should I just add code to run the same sequence under the token == ATTRIBUTES case? This is what I was looking at in the try loop specifically, for some context.

stmt = self.parse(prefix)
_logger.debug("first parse: %s", stmt)
last_stmt = stmt[-1]
vereimyst commented 1 year ago

Regarding reporting the new issue I found, it actually appears that when I try to do autocompletion of a variable (example below), it works properly (mostly?). This is the first block that I have in my Juptyer Notebook file:

conns = GET network-traffic
       FROM file:///home/myst/kestrel-lang/tests/test_bundle.json
       where [network-traffic:dst_port < 10000]

I was pulling this information to replicate the test_completions.py information previously, so I could check what the expected column values would be when I ran `DISP conns'. I ended up getting the following results for different autocompletion calls. (I put the DISP command call in a subsequent block and ran the initial variable assignment separately.)

DISP <autocomplete>       # returns ['TIMESTAMPED', '_', 'conns']
DISP c<autocomplete>      # returns 'onns'
DISP conns<autocomplete>  # returns ''
DISP conns <autocomplete> # returns ['APPLY', 'ATTR', 'DISP', 'FIND', 'GET', 'GROUP', 'INFO', 'JOIN', 'LIMIT', 'LOAD', 'NEW', 'OFFSET', 'SAVE', 'SORT', 'TIMESTAMPED', 'WHERE', '_', 'conns']

I don't quite understand why 'conns' is offered as a suggestion for the last case (or 'TIMESTAMPED' for the first case), but I think the other results are mostly right. (?) It makes most sense to me that the first 3 cases should be parsed the same way (for attributes vs variables as well). I don't quite understand why they are each being treated differently, is there some reason for that? (EDIT: More specifically, I was thinking that the parser shouldn't be looking for the next possible value until there is a space between the cursor position and the last term, right? Also, why is do the partial/complete value autocompletion cases work properly for variables when the same function doesn't work for the same cases for attributes?)


On a separate note, when I was running DISP conns ATTR <autocomplete>, it was only returning ['start', 'end', 'src_ref', 'dst_ref', 'src_port', 'dst_port', 'protocols', 'id'] as the autofill options. When you print out the full table by just running DISP conns, we end up getting separate columns for src_ref.id, src_ref.value, dst_ref.id, and dst_ref.value. I was wondering why it wasn't offering the .id/value options from the self.store.columns() function call? Or maybe just directly how to include the .id/value options?

pcoccoli commented 1 year ago

The first 3 cases look correct to me.

  1. DISP <autocomplete> should suggest all vairables, which in this case is conns and the built-in _. It also suggests TIMESTAMPED because that's a function/transform you can use with a variable to see the (partial) records instead of the entities of a variable.
  2. DISP c<autocomplete> suggests the same as case 1 but filtered for suggestions that start with c.
  3. DISP conns<autocomplete> is complete, so there shouldn't be any suggestions.
  4. DISP conns <autocomplete is tricky. DISP conns is complete, but it suggests all the optional things like ATTR, WHERE, etc., but also anything that you could start a new statement with (it would be nice to not suggest those, but I'm not sure how to make that happen).

The problem with autocomplete attributes is that src_ref.value is technically a column of network-traffic. src_ref is, and the values of that column are id values from either ipv4-addr or ipv6-addr tables. So in order to add the columns from those tables, it would have to know that and basically SQL JOIN those tables. There are some functions in firepit to help with that, but it gets complicated rather quickly.

subbyte commented 1 year ago

Some information to explain:

16:26:17 DEBUG kestrel.session standard auto-complete
16:26:17 DEBUG kestrel.session exception: [ERROR] KestrelSyntaxError: invalid token "" at line 1 column 12, expects "ATTRIBUTES"

This actually triggers the try section. However, an exception was thrown in the first line of the section (indicating an syntax error that the parser cannot parse), which is

stmt = self.parse(prefix)

So the second line

_logger.debug("first parse: %s", stmt)

does not execute to add the debug log, and of source, the last line in this section

self.parse(prefix + " @autocompletions@")

does not executes.

May I ask what the input was producing this?

subbyte commented 1 year ago

@pcoccoli how about adding an API in firepit to simplify the implementation of auto-complete of dotted attributes (so no need to explicitly use JOIN) ?

def list_ref_columns(base_type:str, ref_name:str) -> typing.List[str]:
    # if `ref_name` not ends with `_ref` or `_refs` return empty
    # check the base_type table to get the ref type of the ref_name
    # check the ref type table to get the list of columns to return
pcoccoli commented 1 year ago

That's a possibility. Most of the work is probably done already in module firepit.deref function auto_deref. That will return the list of joins and a projection. I think you could collect all the resulting columns from the projection. It's a somewhat expensive operation, so we should think about how to cache the result.

vereimyst commented 1 year ago

From this comment:

May I ask what the input was producing this?

The input that was producing this was DISP conns ATTR <autocomplete>.


And this comment:

The first 3 cases look correct to me.

  1. DISP <autocomplete> should suggest all vairables, which in this case is conns and the built-in _. It also suggests TIMESTAMPED because that's a function/transform you can use with a variable to see the (partial) records instead of the entities of a variable.
  2. DISP c<autocomplete> suggests the same as case 1 but filtered for suggestions that start with c.
  3. DISP conns<autocomplete> is complete, so there shouldn't be any suggestions.
  4. DISP conns <autocomplete is tricky. DISP conns is complete, but it suggests all the optional things like ATTR, WHERE, etc., but also anything that you could start a new statement with (it would be nice to not suggest those, but I'm not sure how to make that happen).

For this part, I was mainly comparing the differences in addressing autocomplete for a variable vs an attribute. I don't quite understand why a partial completion case isn't working for attributes, but works perfectly fine for a variable. I assume it is because the line is being parsed differently (or something along those lines)... Is there a reason for this difference in treatment or would removing it resolve the bug that the attribute partial autocompletion is running into?


For the src_ref autocompletion, would it possible as an option to just offer up the src_ref part as a suggestion and if the user presses tab after that again, it's a separate case? Or would it be preferable to address src_ref.id/value in this issue direclty? Just wondering if this is also a possibility, since it appears that directly requesting DISP conns ATTR src_ref also does return information (should be the .id values if I remember correctly).

subbyte commented 1 year ago

I did a test to re-execute the auto-completion case you created (you can run it in the python venv where kestrel-lang is installed):

#/usr/bin/env python

import logging

from kestrel.session import Session

logging.basicConfig(level=logging.DEBUG)

stmt = """
conns = get network-traffic
        from file:///tmp/test_bundle.json
        where dst_port < 10000
"""

code = "DISP conns ATTR "

with Session() as session:
    session.execute(stmt)
    result = session.do_complete(code, len(code))
    print(result)

The output is:

DEBUG:kestrel.session:standard auto-complete
DEBUG:kestrel.session:exception: [ERROR] KestrelSyntaxError: invalid token "" at line 1 column 12, expects "ATTRIBUTES"
rewrite the failed statement.
DEBUG:kestrel.session:keywords: {'BY', 'start', 'LOAD', 'LINKED', 'bin', 'OWNED', 'SAVE', 'loaded', 'APPLY', 'disp', 'SORT', 'COUNT', 'DESC', 'join', 'ATTR', 'TIMESTAMPED', 'SUM', 'ASC', 'WITH', 'START', 'created', 'offset', 'sum', 'GET', 'new', 'NULL', 'OFFSET', 'CREATED', 'contained', 'CONTAINED', 'or', 'FROM', 'count', 'accepted', 'asc', 'and', 'WHERE', 'to', 'BIN', 'JOIN', 'MIN', 'LIMIT', 'group', 'get', 'max', 'sort', 'desc', 'timestamped', 'save', 'load', 'OR', 'AVG', 'apply', 'by', 'LOADED', 'on', 'owned', 'NEW', 'INFO', 'null', 'STOP', 'stop', 'attr', 'TO', 'AND', 'last', 'ON', 'AS', 'linked', 'limit', 'MAX', 'NUNIQUE', 'info', 'nunique', 'ACCEPTED', 'GROUP', 'from', 'avg', 'LAST', 'find', 'DISP', 'with', 'FIND', 'where', 'as', 'min'}
DEBUG:kestrel.session:token: ATTRIBUTES
DEBUG:kestrel.session:['ATTRIBUTES'] -> ['ATTRIBUTES']
['ATTRIBUTES']

In short, the auto-completion code is executed as expected. And if you added a elif for ATTRIBUTES, it should be triggered.

subbyte commented 1 year ago

For now, @vereimyst may want to go ahead and use src_ref as the suggestion but not src_ref.value. We can beef it up in a later version.

A possible future upgrade is to suggest src_ref when the user press tab after ATTR. and suggest .value when the user press tab after src_ref (two step auto completion).