Open subbyte opened 3 years ago
Hello! I've been working on this issue for a bit now and I thought it would be good to post an update for where I'm sort of stuck on right now. You can find what I have so far for the attribute autocompletion feature here.
Firstly, I've somewhat broken down the issue to try and simplify the steps of work. The plan would be to submit separate pull requests, each addressing one of the following cases:
token == ATTRIBUTES
(old STIXPATHS
)
DISP <variable> ATTR <autocomplete>
token == STIXPATTERNBODY
WHERE <autocomplete>
For the time being, I've been focusing on Case 1.
Some questions I had whilst working on this issue thus far:
For Case 1 in particular, it is possible for a user to declare a variable in, for example, a block in Jupyter Notebook and write a statement to display certain attribute information of said variable in the same block. This means that the variable cannot be assumed to be in scope automatically. I was planning to address this case by not offering suggestions entirely, unless other behavior is preferred instead.
# Jupyter Notebook example above
tmp = NEW process [ {"name": "potato"} ]
DISP tmp ATTR <autocomplete?>
-- I was also working on writing a function to parse the code and find the variable that the user is trying to get attributes for, but I wasn't sure how to limit the parsing to the beginning of a statement of code (e.g. DISP would be the beginning in the code block above). I was thinking that I could check until the last command called, but I wasn't sure if that was the most efficient method. I also haven't put much thought into how I would go about doing that quite yet.
The biggest problem I've run into is this (possible) bug that I've found. I though it would be best to confirm my understanding of this issue before I moved to create an issue for it. Assuming there are attributes that start with "n," when I hit tab with DISP browsers ATTR n
in my code block (Jupyter Notebook), it appears that the parsing treats "n" as a full attribute value and moves onto the next field for autocompletion suggestions... while retaining "n" as the starting letter for said next field. These are the lines in the debug log pertaining to this occurrence:
14:46:47 DEBUG kestrel.session code="# display the information (attributes name, pid) of the entities in variable `browsers`
DISP browsers ATTR n" prefix="# display the information (attributes name, pid) of the entities in variable `browsers`
DISP browsers ATTR n" last_word="n"
14:46:47 DEBUG kestrel.session standard auto-complete
14:46:47 DEBUG kestrel.session first parse: [{'command': 'disp', 'input': 'browsers', 'transform': None, 'attrs': 'n'}]
14:46:47 DEBUG kestrel.session exception: [ERROR] KestrelSyntaxError: invalid character "@" at line 2 column 21, expects one of ['APPLY', 'SORT', 'GET', 'JOIN', 'NEW', 'TRANSFORM', 'LOAD', 'GROUP', 'INFO', 'LIMIT', 'OFFSET', 'FIND', 'SAVE', 'DISP', 'VARIABLE']
rewrite the failed statement.
14:46:48 DEBUG kestrel.session keywords: {'last', 'NEW', 'TO', 'START', 'CONTAINED', 'TIMESTAMPED', 'FROM', 'MAX', 'OR', 'ASC', 'SORT', 'null', 'count', 'BIN', 'info', 'disp', 'attr', 'sum', 'accepted', 'FIND', 'find', 'limit', 'bin', 'OWNED', 'OFFSET', 'or', 'linked', 'stop', 'save', 'as', 'DESC', 'LOADED', 'where', 'STOP', 'with', 'NULL', 'owned', 'GROUP', 'get', 'LIMIT', 'WITH', 'sort', 'max', 'ATTR', 'JOIN', 'INFO', 'ACCEPTED', 'to', 'new', 'ON', 'GET', 'COUNT', 'WHERE', 'SUM', 'group', 'LINKED', 'AND', 'contained', 'APPLY', 'on', 'created', 'timestamped', 'and', 'AS', 'min', 'offset', 'load', 'from', 'asc', 'apply', 'MIN', 'desc', 'DISP', 'by', 'AVG', 'BY', 'start', 'LOAD', 'CREATED', 'nunique', 'avg', 'loaded', 'LAST', 'NUNIQUE', 'SAVE', 'join'}
14:46:48 DEBUG kestrel.session token: APPLY
14:46:48 DEBUG kestrel.session token: SORT
14:46:48 DEBUG kestrel.session token: GET
14:46:48 DEBUG kestrel.session token: JOIN
14:46:48 DEBUG kestrel.session token: NEW
14:46:48 DEBUG kestrel.session token: TRANSFORM
14:46:48 DEBUG kestrel.session token: LOAD
14:46:48 DEBUG kestrel.session token: GROUP
14:46:48 DEBUG kestrel.session token: INFO
14:46:48 DEBUG kestrel.session token: LIMIT
14:46:48 DEBUG kestrel.session token: OFFSET
14:46:48 DEBUG kestrel.session token: FIND
14:46:48 DEBUG kestrel.session token: SAVE
14:46:48 DEBUG kestrel.session token: DISP
14:46:48 DEBUG kestrel.session token: VARIABLE
14:46:48 DEBUG kestrel.session ['TIMESTAMPED', '_', 'apply', 'browsers', 'disp', 'find', 'get', 'group', 'info', 'join', 'limit', 'load', 'new', 'offset', 'proclist', 'save', 'sort'] -> ['ew']
I am assuming that this isn't intended behavior. I thought a possible cause of the issue was the space before @autocompletions@
in self.parse(prefix + " @autocompletions@")
, but removing it didn't appear to change anything.
I believe that concludes my update for the time being. Please let me know if more details are required for anything I've mentioned!
Very good problem description, @vereimyst ! Now you are 1/3 down the road---a full procedure of solving a problem has three phases: describing/formalizing the problem, figuring out a solution, and implementing it.
case 1: token == ATTRIBUTES
DISP
and assign
expression
-> attr_clause
-> ATTRIBUTES
case 2: token == ENTITY_ATTRIBUTE_PATH
GET
and FIND
where_clause
-> ecg_pattern
-...-> comparison
-> ENTITY_ATTRIBUTE_PATH
You are right that before a command get executed---the variable is established in a Kestrel session---the auto complete function cannot query the session.symtable
to get the Kestrel variable. That means if the following is in one Jupyter cell, the auto complete function cannot obtain information of the variable tmp
since it is not executed/established yet.
tmp = NEW process [ {"name": "potato"} ]
DISP tmp ATTR <autocomplete?
Agree with you choice that in this case, the auto complete should give 0 results as it gets empty checking tmp
in session.symtable
.
There are two approaches to locate the last statement in a multi-statement Jupyter cell (so you can get the variable from the last statement)
DISP
and assign
in Kestrel that could lead to auto completion of token == ATTRIBUTES
. You can walk words
in do_complete()
back to find the first keyword DISP
and =
(need to have some logic here if DISP
found or not to decide whether it is DISP
command or assign
command). Then go forward in words
to locate the variable name according to the grammar defined for the two commands in syntax/Kestrel.lark
.
from lark import Lark
grammar=r""" ?start: (disp|cmd)+ disp: "disp"i VARIABLE ("attr"i ATTRIBUTES)? cmd: "cmd"i "from"i PATHS "where"i PATHS ATTRIBUTE: CNAME ATTRIBUTES: ATTRIBUTE ("," WS+ ATTRIBUTE)* VARIABLE: CNAME PATHS: (LETTER|DIGIT|/[-_.:,]/)+ %import common (CNAME, LETTER, DIGIT, WS) %ignore WS """ g = Lark(grammar, parser='lalr')
pattern = r"DISP xyz ATTR pid, name, command_line CMD FROM xxx WHERE yyy"
ast = g.parse(pattern) print(ast)
4. Good catch of the problem. I just did an experiment with the toy code above with
pattern = r"DISP xyz ATTR n@auto@"
And the parser thinks `n` is the end of the first command `DISP`, and `@auto@ is the beginning of the second command:
DISP xyz ATTR n@auto@ ^ Expected one of:
Which results in the problem you encountered.
This is an issue not only affecting the new feature you are adding. For any auto complete, our current logic in do_complete()
will not work when a user is auto completing the last token in a statement when the prefix of the last token typed is already a valid token for that, e.g., n
is a valid ATTRIBUTES
token.
Please report/describe this in a new issue (bug issue) and you may want to work on this new issue first. Thanks again for finding the issue. As I mentioned, this is a more general issue than ATTRIBUTES auto completion. That means fixing it will likely upgrade our logic in do_complete()
. You can think about possible solutions when writing the issue, and a naive idea is to run the parse()
function twice to better understand what needs to be auto completed: one on prefix + " @autocompletions@"
, and one on prefix
- last word. Maybe you come up with a more elegant solution than our hack of using prefix + " @autocompletions@"
:-)
GET
or FIND
to get entity type (similar to parsing variable you are doing for case 1), so let's work on a separate PR for it after case 1 PR, after the new issue PR.stix-shifter has a mapping
command (so presumably an API call) that returns a connector's mapping. We can parse the from_stix_map portion of that to help autocomplete STIX patterns. If we can parse the partial statement (I think this is the main issue - I'm not sure how to get the AST of an incomplete statement with lark) and get the return type and data source, then we can get the complete list of STIX object paths that can be used in a pattern for that datasource.
Note that this is completely different than autocompleting properties for a variable. In that case, we need to get the variable name, then find its type, and check the corresponding database table.
I'm not sure if this is helpful, but maybe instead of adding @autocompletions@
and re-parsing, we could start looking at the parsed statements and completing from that. E.g.:
In [7]: result = parse('tmp = NEW process [ {"name": "potato"} ]\nDISP tmp ATTR n')
In [8]: result
Out[8]:
[{'command': 'new',
'type': 'process',
'data': '[ {"name": "potato"} ]',
'output': 'tmp'},
{'command': 'disp', 'input': 'tmp', 'transform': None, 'attrs': 'n'}]
We can see here that the last statement is a disp
, and I think we know the last_word
is "n". From that can we infer that we're looking for an attribute of tmp
that starts with "n"? That means stepping back to the previous statement (in this case), recognizing that tmp
is being defined there, so we need to json.loads
data
and look for a key starting with "n".
Also, a friendly reminder that there's a unit test for autocomplete in tests/test_completion.py - you should add these cases first and see them fail, then implement the change until the new cases (and all the old cases) pass.
It actually just occurred to me that while running the auto-complete function with the current bugged aspect of treating the "n" as a completed ATTRIBUTES
field, there is this line in the debug log that says this:
16:48:51 DEBUG kestrel.session standard auto-complete
16:48:51 DEBUG kestrel.session first parse: [{'command': 'disp', 'input': 'conns', 'transform': None, 'attrs': 'n'}]
16:48:51 DEBUG kestrel.session exception: [ERROR] KestrelSyntaxError: invalid character "@" at line 1 column 18, expects one of ['APPLY', 'JOIN', 'OFFSET', 'TRANSFORM', 'SAVE', 'GROUP', 'LOAD', 'NEW', 'VARIABLE', 'DISP', 'INFO', 'GET', 'SORT', 'LIMIT', 'FIND']
which I believe fits that output format you included above. I didn't see the first parse
portion in the call where the autocompletion works correctly (as shown below). I was wondering why it wasn't going through the try loop and if there might be some way to force it to go through the try loop? I thought it would be easiest since I could just refer to the input of the last statement parsed to get the variable in question, assuming the statement follows correct Kestrel statement (is it necessary to add a checking statement if the value is a variable?).
16:26:17 DEBUG kestrel.session standard auto-complete
16:26:17 DEBUG kestrel.session exception: [ERROR] KestrelSyntaxError: invalid token "" at line 1 column 12, expects "ATTRIBUTES"
If not, should I just add code to run the same sequence under the token == ATTRIBUTES
case? This is what I was looking at in the try loop specifically, for some context.
stmt = self.parse(prefix)
_logger.debug("first parse: %s", stmt)
last_stmt = stmt[-1]
Regarding reporting the new issue I found, it actually appears that when I try to do autocompletion of a variable (example below), it works properly (mostly?). This is the first block that I have in my Juptyer Notebook file:
conns = GET network-traffic
FROM file:///home/myst/kestrel-lang/tests/test_bundle.json
where [network-traffic:dst_port < 10000]
I was pulling this information to replicate the test_completions.py information previously, so I could check what the expected column values would be when I ran `DISP conns'. I ended up getting the following results for different autocompletion calls. (I put the DISP command call in a subsequent block and ran the initial variable assignment separately.)
DISP <autocomplete> # returns ['TIMESTAMPED', '_', 'conns']
DISP c<autocomplete> # returns 'onns'
DISP conns<autocomplete> # returns ''
DISP conns <autocomplete> # returns ['APPLY', 'ATTR', 'DISP', 'FIND', 'GET', 'GROUP', 'INFO', 'JOIN', 'LIMIT', 'LOAD', 'NEW', 'OFFSET', 'SAVE', 'SORT', 'TIMESTAMPED', 'WHERE', '_', 'conns']
I don't quite understand why 'conns' is offered as a suggestion for the last case (or 'TIMESTAMPED' for the first case), but I think the other results are mostly right. (?) It makes most sense to me that the first 3 cases should be parsed the same way (for attributes vs variables as well). I don't quite understand why they are each being treated differently, is there some reason for that? (EDIT: More specifically, I was thinking that the parser shouldn't be looking for the next possible value until there is a space between the cursor position and the last term, right? Also, why is do the partial/complete value autocompletion cases work properly for variables when the same function doesn't work for the same cases for attributes?)
On a separate note, when I was running DISP conns ATTR <autocomplete>
, it was only returning ['start', 'end', 'src_ref', 'dst_ref', 'src_port', 'dst_port', 'protocols', 'id']
as the autofill options. When you print out the full table by just running DISP conns
, we end up getting separate columns for src_ref.id
, src_ref.value
, dst_ref.id
, and dst_ref.value
. I was wondering why it wasn't offering the .id/value
options from the self.store.columns()
function call? Or maybe just directly how to include the .id/value
options?
The first 3 cases look correct to me.
DISP <autocomplete>
should suggest all vairables, which in this case is conns
and the built-in _
. It also suggests TIMESTAMPED
because that's a function/transform you can use with a variable to see the (partial) records instead of the entities of a variable.DISP c<autocomplete>
suggests the same as case 1 but filtered for suggestions that start with c
.DISP conns<autocomplete>
is complete, so there shouldn't be any suggestions.DISP conns <autocomplete
is tricky. DISP conns
is complete, but it suggests all the optional things like ATTR
, WHERE
, etc., but also anything that you could start a new statement with (it would be nice to not suggest those, but I'm not sure how to make that happen).The problem with autocomplete attributes is that src_ref.value
is technically a column of network-traffic
. src_ref
is, and the values of that column are id
values from either ipv4-addr
or ipv6-addr
tables. So in order to add the columns from those tables, it would have to know that and basically SQL JOIN those tables. There are some functions in firepit to help with that, but it gets complicated rather quickly.
Some information to explain:
16:26:17 DEBUG kestrel.session standard auto-complete
16:26:17 DEBUG kestrel.session exception: [ERROR] KestrelSyntaxError: invalid token "" at line 1 column 12, expects "ATTRIBUTES"
This actually triggers the try section. However, an exception was thrown in the first line of the section (indicating an syntax error that the parser cannot parse), which is
stmt = self.parse(prefix)
So the second line
_logger.debug("first parse: %s", stmt)
does not execute to add the debug log, and of source, the last line in this section
self.parse(prefix + " @autocompletions@")
does not executes.
May I ask what the input was producing this?
@pcoccoli how about adding an API in firepit to simplify the implementation of auto-complete of dotted attributes (so no need to explicitly use JOIN
) ?
def list_ref_columns(base_type:str, ref_name:str) -> typing.List[str]:
# if `ref_name` not ends with `_ref` or `_refs` return empty
# check the base_type table to get the ref type of the ref_name
# check the ref type table to get the list of columns to return
That's a possibility. Most of the work is probably done already in module firepit.deref
function auto_deref
. That will return the list of joins and a projection. I think you could collect all the resulting columns from the projection.
It's a somewhat expensive operation, so we should think about how to cache the result.
From this comment:
May I ask what the input was producing this?
The input that was producing this was DISP conns ATTR <autocomplete>
.
And this comment:
The first 3 cases look correct to me.
DISP <autocomplete>
should suggest all vairables, which in this case isconns
and the built-in_
. It also suggestsTIMESTAMPED
because that's a function/transform you can use with a variable to see the (partial) records instead of the entities of a variable.DISP c<autocomplete>
suggests the same as case 1 but filtered for suggestions that start withc
.DISP conns<autocomplete>
is complete, so there shouldn't be any suggestions.DISP conns <autocomplete
is tricky.DISP conns
is complete, but it suggests all the optional things likeATTR
,WHERE
, etc., but also anything that you could start a new statement with (it would be nice to not suggest those, but I'm not sure how to make that happen).
For this part, I was mainly comparing the differences in addressing autocomplete for a variable vs an attribute. I don't quite understand why a partial completion case isn't working for attributes, but works perfectly fine for a variable. I assume it is because the line is being parsed differently (or something along those lines)... Is there a reason for this difference in treatment or would removing it resolve the bug that the attribute partial autocompletion is running into?
For the src_ref
autocompletion, would it possible as an option to just offer up the src_ref
part as a suggestion and if the user presses tab after that again, it's a separate case? Or would it be preferable to address src_ref.id/value
in this issue direclty? Just wondering if this is also a possibility, since it appears that directly requesting DISP conns ATTR src_ref
also does return information (should be the .id
values if I remember correctly).
I did a test to re-execute the auto-completion case you created (you can run it in the python venv where kestrel-lang
is installed):
#/usr/bin/env python
import logging
from kestrel.session import Session
logging.basicConfig(level=logging.DEBUG)
stmt = """
conns = get network-traffic
from file:///tmp/test_bundle.json
where dst_port < 10000
"""
code = "DISP conns ATTR "
with Session() as session:
session.execute(stmt)
result = session.do_complete(code, len(code))
print(result)
The output is:
DEBUG:kestrel.session:standard auto-complete
DEBUG:kestrel.session:exception: [ERROR] KestrelSyntaxError: invalid token "" at line 1 column 12, expects "ATTRIBUTES"
rewrite the failed statement.
DEBUG:kestrel.session:keywords: {'BY', 'start', 'LOAD', 'LINKED', 'bin', 'OWNED', 'SAVE', 'loaded', 'APPLY', 'disp', 'SORT', 'COUNT', 'DESC', 'join', 'ATTR', 'TIMESTAMPED', 'SUM', 'ASC', 'WITH', 'START', 'created', 'offset', 'sum', 'GET', 'new', 'NULL', 'OFFSET', 'CREATED', 'contained', 'CONTAINED', 'or', 'FROM', 'count', 'accepted', 'asc', 'and', 'WHERE', 'to', 'BIN', 'JOIN', 'MIN', 'LIMIT', 'group', 'get', 'max', 'sort', 'desc', 'timestamped', 'save', 'load', 'OR', 'AVG', 'apply', 'by', 'LOADED', 'on', 'owned', 'NEW', 'INFO', 'null', 'STOP', 'stop', 'attr', 'TO', 'AND', 'last', 'ON', 'AS', 'linked', 'limit', 'MAX', 'NUNIQUE', 'info', 'nunique', 'ACCEPTED', 'GROUP', 'from', 'avg', 'LAST', 'find', 'DISP', 'with', 'FIND', 'where', 'as', 'min'}
DEBUG:kestrel.session:token: ATTRIBUTES
DEBUG:kestrel.session:['ATTRIBUTES'] -> ['ATTRIBUTES']
['ATTRIBUTES']
print(result)
, which is correct.ATTRIBUTES
in tmp.append(token).In short, the auto-completion code is executed as expected. And if you added a elif
for ATTRIBUTES
, it should be triggered.
For now, @vereimyst may want to go ahead and use src_ref
as the suggestion but not src_ref.value
. We can beef it up in a later version.
A possible future upgrade is to suggest src_ref
when the user press tab after ATTR
. and suggest .value
when the user press tab after src_ref
(two step auto completion).
Is your feature request related to a problem? Please describe. The current auto-complete function scans variable names (in the session), data source names, and analytics interface names. And it will be awesome if we can auto-complete entity attribute (and entity names also).
Advanced version: we may cache entity attributes in VarStruct for fast access.