Open import-pandas-as-numpy opened 1 year ago
@Robin5605 @AbooMinister25 @jonathan-d-zhang @Recursive-Error Review/eyes requested.
for this question ~
Will this be something we can easily extend to other languages? If we ever elect to scan another ecosystem such as NPM, using an AST might be useful there too. If we can avoid footgunning ourselves by abstracting this in a way that makes drop in functionality useful.
Considering that the semantics of the languages differ, I imagine that identifying what specific nodes to apply specific rules to would change as well - depending on how we structure the API, I suppose maybe something like providing mappings of nodes to a set of rules or whatever, differing per language, might be feasible.
Will this be something we can easily extend to other languages? If we ever elect to scan another ecosystem such as NPM, using an AST might be useful there too. If we can avoid footgunning ourselves by abstracting this in a way that makes drop in functionality useful.
Superficially, technically yes. If we go with something like treesitter, for instance, it supports parsing a whole bunch of languages
Specification
Implement a feature which allows us to utilize Python's abstract syntax tree to match our current YARA rules against.
This feature should selectively run on files ending in ".py".
This feature should parse the nodes in a ".py" file for all nodes.
This feature should treat AST nodes as separate files ending in ".py" for the purposes of our current ruleset.
This feature should implement a manner to target specific rules to the type of node.
This feature should raise which AST node a specific rule was found in, the type of the node, and the the underlying code that that node was found in.
This feature should compile this information to indicate in the original code which nodes matched which rules.
This feature might contain functionality to selectively ignore or regard constants and arguments as a default behavior specified in a rule.
This feature might spawn a separate folder of YARA rules that are only compiled and invoked against AST nodes to prevent contamination of current rulesets.
Motivation
The Abstract Syntax tree offers much more context for what Python understands a function to be doing. The use of a string in one context doesn't necessitate that string being an indicator in the entire program. Passing something like
rm -rf /
in a subprocess or system command is far riskier than finding that string in a docstring, but current YARA conventions have created an issue where we must either check to see that it isn't in a docstring (currently impossible, no lookaheads/lookbehinds) or we must specify the specific contexts that this command must flag in regex itself. (As in, in this case, we would have to look for subprocess calls with those arguments a list passed.)Additionally, this would be a significant quality of life enhancement to PyPI staff, who would now be pointed to a specific line of malicious behavior.
Precedent for this exists in two forms, Semgrep and YARA itself. Semgrep is able to comprehend far more semantics of the language, to derive the context in which something is used. YARA has pefile section features to allow you to reference specific sections of a PE file to derive behavior in the context that it might appear. (For instance,
.rsrc
containing amalware.dll
is something that YARA currently supports detection for.)Open Questions
Are we reinventing Semgrep? Semgrep does not currently exist in Rust, but contains many of the same functionalities that we're aiming to replicate here.
Will we need to spin up a new ruleset for this? We stand to pollute our current rules with additional needless metadata fields to specify behaviors of these AST parsers if so.
Should we carve out functionality for the deobfuscators that Stickie and IlluminatiFish are working on while we're writing this feature? They make heavy use of the AST, and we'll likely want to bake this into the scanner at some point.
Will this be something we can easily extend to other languages? If we ever elect to scan another ecosystem such as NPM, using an AST might be useful there too. If we can avoid footgunning ourselves by abstracting this in a way that makes drop in functionality useful.
Requirements