oasis-open / cti-python-stix2

OASIS TC Open Repository: Python APIs for STIX 2
https://stix2.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
356 stars 113 forks source link

Parser is not robust!! #540

Open dynamic-modeller opened 2 years ago

dynamic-modeller commented 2 years ago

Hi,

I have been playing with your library and found that the parser is not robust, and seems to break easily in two sitatuions:

  1. Tring to parse threat reports from your own website, specifically the 3 at the bottom of this page (https://oasis-open.github.io/cti-documentation/stix/examples)
  2. Trying to parse srtix outputs that include extra properties, the Mitre Attack Demo (https://github.com/mitre-attack/attack-stix-data)

In both situations, the parsing function breaks but for two different reasons.

When considering the example threat reports included on your examples page, the parser can successfully import poisonivy.json, but fails on APT1.json and IMDDOS.json. All files were retrieved from the links on your website. These threat reports should be able to be parsed, but fail because of a problem with the codec

When considering the Mitre Attack example, it is acknowledged that there are additional properties included,however the content is still in Stix format. None of the 3 bundles at the Mitre github can be parsed, and all fail with an error [ Unexpected properties for Bundle: (spec_version)..]

the behaviour one expects is: a. all valid stix files can be imported, particularly the demos from your site b. stix files with extensions should still be imported, with either the additional fields in place, or removed, depending on some setting

if the parser was more robust, it could easily form the core of a really useful stix importer, and the existing stix objects could then be expanded to report their own typeql import statements (e.g. develop a to_typedb function for every object).

However the parser is currently not robust to even the slightest variations,a nd thereby we might have to write our own parser, which seems a waste. Can you advise please?

Thanks

chisholm commented 2 years ago

When considering the example threat reports included on your examples page, the parser can successfully import poisonivy.json, but fails on APT1.json and IMDDOS.json. All files were retrieved from the links on your website. These threat reports should be able to be parsed, but fail because of a problem with the codec

APT1 and Poison Ivy work for me:

>>> with open("apt1.json", encoding="utf-8") as fp:
...    bundle = stix2.parse(fp)
...
>>> len(bundle.objects)
76
>>> with open("poisonivy.json", encoding="utf-8") as fp:
...    bundle = stix2.parse(fp)
...
>>> len(bundle.objects)
155

IMDDOS may have a problem: it uses a STIX 2.0 bundle which declares its spec version as "2.0", but includes a location object, which was new in STIX 2.1. Or, maybe they consider it custom with respect to STIX 2.0? Not sure about that. You can get it to parse if you allow custom content:

>>> with open("imddos.json", encoding="utf-8") as fp:
...     bundle = stix2.parse(fp, allow_custom=True)
...
>>> len(bundle.objects)
15

The location object will come out as a plain dict.

When considering the Mitre Attack example, it is acknowledged that there are additional properties included,however the content is still in Stix format. None of the 3 bundles at the Mitre github can be parsed, and all fail with an error [ Unexpected properties for Bundle: (spec_version)..]

These are kind of oddball STIX 2.0 bundles: they actually declare their spec_version as "2.1". I am not sure that's non-compliant, since the STIX 2.0 spec says "The value of this property MUST be 2.0 for bundles containing STIX Objects defined in this specification", which leaves open the possibility that you could use other values for objects defined in other specs, e.g. "2.1" for objects from the STIX 2.1 spec. But regardless, the stix2 library does not support objects from other STIX versions in a STIX 2.0 bundle (those bundles all seem to have STIX 2.1 objects). That's a limitation of the library.

The attack-stix-data repo bills itself as containing STIX 2.1 content. So I think the real question here, is why are they using STIX 2.0 bundles?

If I remove the "spec_version" property from one of those bundles, which effectively converts it to a STIX 2.1 bundle, then it parses:

>>> with open("ics-attack.json", encoding="utf-8") as fp:
...      bundle = stix2.parse(fp, allow_custom=True)
...
>>> len(bundle.objects)
830

the behaviour one expects is: a. all valid stix files can be imported, particularly the demos from your site b. stix files with extensions should still be imported, with either the additional fields in place, or removed, depending on some setting

Well, it seems not all of that content is valid. :\ Or at least, may not be as its creator intended it to be.

I believe unregistered STIX 2.1 extensions should be accepted (if that's what you mean). If you find an error regarding extensions, please create an issue and include some details.

if the parser was more robust, it could easily form the core of a really useful stix importer, and the existing stix objects could then be expanded to report their own typeql import statements (e.g. develop a to_typedb function for every object).

However the parser is currently not robust to even the slightest variations,a nd thereby we might have to write our own parser, which seems a waste. Can you advise please?

My impression from looking at this, is that these are pretty much all content issues.

I did find a potential small issue with STIX spec version detection with bundles: presence of spec_version in the bundle should indicate version "2.0", regardless of its value (in STIX 2.0, the value indicates the spec version of the content of the bundle, not the bundle itself). I'll create a separate issue.

dynamic-modeller commented 2 years ago

Hey Chisholm,

Thanks for your speedy reply. Your were so right, and my opinion of the parser has switched 180 degrees. Its exactly what i was hoping for, I just needed two small changes.

Firstly, i had to set the encoding in the file open, as some of the files came from non windows source. Trap for young players, indeed. Instead of for file in file_paths: with open(file) as f: json_text = json.load(f)

I have changed it to specifiy encoding, which fixes up heaps of problems

for file in file_paths: with open(file, encoding='utf-8') as f: json_text = json.load(f)

Further, I have found out more about the wonderful parsing flag to allow custom, like

local_obj = parse(obj, allow_custom=True)

This then made the parser amazingly robust. It even handled the MITRE ATT&ACK stuff. On reading the docs its apparent that it even makes new properties for you, and one presumes even new classes. So, now i think your parser is amang, and of huge benefit for the typedb open source stix integration we are planning. Great Stuff.

thanks for that, more on my typedb integration into the default stix database soon