oasis-open / cti-pattern-validator

OASIS TC Open Repository: Validate patterns used to express cyber observable content in STIX Indicators
https://stix2-patterns.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
26 stars 23 forks source link

Add more helper functions to inspection #37

Open johnwunder opened 6 years ago

johnwunder commented 6 years ago

It would be nice to get summary stats about a pattern:

This would help you determine whether a pattern was parsable or usable by your tool.

chisholm commented 6 years ago

The inspector gives you a lot of information, including:

Here is an example pattern, modified slightly from one of the test cases:

[foo:bar.xyz=1 and bar:foo not > 33] repeats 12 times
    or ([baz:bar issubset '1234'] followedby [baz:quux not like 'a_cd'])

The inspector returns a 3-tuple consisting of comparison data, observation operators, and qualifiers. The most complex data is the comparison data. Below is the comparison data which would be currently be returned for the above pattern:

{'bar': [(['foo'], 'NOT >', '33')],
 'baz': [(['bar'], 'ISSUBSET', "'1234'"), (['quux'], 'NOT LIKE', "'a_cd'")],
 'foo': [(['bar', 'xyz'], '=', '1')]}

The outermost structure is: {observable_type: list_of_comparisons}. So all comparisons for a particular observable type are grouped together. In the above pattern, three observable types are in use, so the map has three entries: bar, baz, and foo. baz is used in two comparisons, so there are two elements in its list. The rest have only one.

The structure of each element of each list is: (path_components, operator, value). path_components is itself a list, since the inspector splits up each path into its components, to make it easier for users to identify them. Most paths in this example have one component, but the first path has two (bar.xyz), so its corresponding list has length two. The operator is a string copied from the comparison expression (but uppercased, for uniformity), and likewise for the value. Quotes and the like are not stripped from a value, since they can assist in determining its type.

Observation operators are given as a set of strings. Each operator in use is included:

{'OR', 'FOLLOWEDBY'}

Finally, each qualifier in use is also given as a set of strings:

{'REPEATS 12 TIMES'}

So I think the information described in your first bullet is there. The information from the second bullet might be there, depending on what you're looking for. "Observation expression" encompasses a lot of inner structure. I think the components of that structure are basically there.

Here is a usage example I wrote, for a different task. It is intended to determine whether a pattern consists of a single ipv4 address equality comparison:

def is_ipv4_equals_pattern(pattern):
    """
    Determines whether the given pattern is of the form
        [ipv4:value = <some_address>]
    """
    results = pattern.inspect()
    return not results.observation_ops \
        and not results.qualifiers \
        and len(results.comparisons) == 1 \
        and "ipv4-addr" in results.comparisons \
        and len(results.comparisons["ipv4-addr"]) == 1 \
        and results.comparisons["ipv4-addr"][0][0] == ["value"] \
        and results.comparisons["ipv4-addr"][0][1] in ("=", "==")

The things it checks are, in order:

So that's a lot of checks, but I think it shows that you can get a lot of specificity from the data provided. You could probably think of even more that could be included, but I think the more structure and complexity there is, the more complicated the resulting checks can be. I just wanted to determine whether the pattern was a simple equality comparison of an IP address, but that took seven separate tests. It could grow even larger. A balance probably needs to be struck, between flexibility and usability.

Does this address your needs?

johnwunder commented 6 years ago

Thanks. Yes, I saw that the inspector can provide that structure already. This issue was kind of about the opposite...not the information that you can get, but how easy it is to get and pull it out.

For example, to get the list of objects you would collect the unique set of keys in the dictionary. To get the operators you would get the (unique) second value of the tuple for each of the values. That all is fine and workable, but it seemed to me like something that many of the users of the library would want...so if those convenience functions were provided as part of the library it would save everyone from having to rewrite them.

E.g. if the inspect function returned an intermediate object you could call functions like get_objects, get_object_paths, get_operators and have it more immediately usable.

gtback commented 6 years ago

@infosec-alchemist and I were talking about this before I saw this issue. I agree we need to figure out how the data from the inspector is likely to be used, and make sure that data is easily accessible.

infosec-alchemist commented 6 years ago

I created a function in pattern.py which will return a json object containing a list of the objects and properties that are part of the pattern.

I'm basically taking the pattern object, and extracting parts. This is needed as part of Unfetter, making the call.

However, maybe I should have a different python program that takes the Pattern object and formats it for the outside program. Letting Pattern.py be more of the workhorse.

I think thats more about how you want to architect your code interaction with other programs.

gtback commented 6 years ago

@johnwunder @infosec-alchemist, just wanted to bump this issue. Is there anything that's needed from this library (for the pattern translator perhaps)?

gtback commented 6 years ago

I removed this from the 1.0 milestone, since we're trying to get a 1.0 version out pretty soon, and I'm still not sure what additional helper functions would be useful.

theY4Kman commented 5 years ago

I've been working on a pattern expression parser at Perch (@usePF) that spits out a dict-based tree representation of a Pattern. I figure a tool/language is only as effective as one's ability to troubleshoot/debug it, so along with the dict-based tree is a YAML-based DSL for human consumption.

The working title has been "Pattern Tree". I'd like to open-source the thing, and had a hunch it might fit in this repo. I can work on a PR in my spare time if y'all agree.

So, I, uhhhhh... might've written something resembling an entire spec for the thing... but sparing y'all that, here's an excerpt from the original PR. If you really wanna subject yourself to the torture, I can reproduce the spec :P

Excerpt

This PR proposes handling STIX2 Pattern Expressions with a new class, intel.pattern.Pattern. This class has a method, to_dict_tree(), which converts the ANTLR parse tree to a new dict-based tree structure, intended to be more easily consumable.

from intel.pattern import Pattern
pattern = Pattern("[domain-name:value = 'http://xyz.com/download']")

assert pattern.to_dict_tree() == {
    'pattern': {
        'observation': {
            'objects': {'domain-name'},
            'join': None,
            'qualifiers': None,
            'expressions': [
                {'comparison': {
                    'object': 'domain-name',
                    'path': ['value'],
                    'negated': None,
                    'operator': '=',
                    'value': 'http://xyz.com/download',
                }}
            ]
        }
    }
}

A specialized YAML representation is also proposed, to make visualization of this data a little less cumbersome:

from intel.pattern import Pattern
pattern = Pattern("[domain-name:value = 'http://xyz.com/download']")

assert str(pattern.to_dict_tree()) == '''\
pattern:
  observation:
    objects: {domain-name}
    join:
    qualifiers:
    expressions:
      - comparison:
          object: domain-name
          path: [value]
          negated:
          operator: '='
          value: http://xyz.com/download
'''
chisholm commented 5 years ago

I am unclear on your goal for this: is it intended to represent the complete original pattern semantics, or just some selected details (like the pattern inspector which is the subject of this issue)? The name "Pattern Tree" makes me think of an AST, which is intended to capture full semantics, but it's not clear from these examples that your trees do that.

theY4Kman commented 5 years ago

Aye, the intention is to retain the complete original pattern semantics. After reading your comment, I figured the pattern inspector is doing exactly what it's meant to, so I've packaged this PatternTree thing as a standalone library: https://github.com/usePF/dendrol

chisholm commented 5 years ago

Ok. Fyi, some AST functionality was written, although it is currently not centralized in one place. There are AST node classes in the stix2 project, but the only AST building code is in the slider as far as I know, since I guess that's the only place an AST has been needed so far.

We have talked about other uses for it, e.g. using pattern structure to determine semantic equivalence. Maybe if there were enough of a need for it outside of the slider, the AST builder would receive a "promotion" of some sort, to the main stix2 project.

theY4Kman commented 5 years ago

Ah, maybe "the intention" is too strong of a posit — it's one of the intentions, for sure; the real impetus was wanting to extract from a Pattern expression more than just what was in it, but how those things fit together.

I may've missed some functionality (no, I definitely missed a bunch of shit) while searching the OASIS STIX libraries, but if nonstandard evaluation of expressions is desired (like, if the whole STIX2 Observed Data bundle isn't available at once to be matched against), it seemed there were only two options: pass an ANTLR Listener to Pattern.walk(), building up a custom representation (essentially, a whole 'nother parser); or deal directly with the grammar through ANTLR ('cause the parse tree isn't exposed to adults).

Not many devs understand how to work with ANTLR, and STIX2 Patterns are too fuckin cool to hide behind that wall. Working with STIX2 objects and relationships between them feels pretty natural in Python, but the Pattern expressions have felt pretty opaque. As a dev, one might encounter Patterns in Indicators, see the possibilities, and want them to be more than just strings. One might naturally end up in this repo, thinking Pattern.inspect() is right up that alley — but it just gives "yes, an equal sign is used". Like, how the hell do I get to the meat!?

If the community at large is to do cool shit with STIX2, like I'm excited to do, I can't ask them to understand formal grammars or ANTLR. That's the impetus of the tool, and why I thought it made sense here.

(reading all that back, I'm realizing the magnitude of my passion. I won't ask you to forgive it, but I acknowledge my bluntness)