oasis-open / cti-python-stix2

OASIS TC Open Repository: Python APIs for STIX 2
https://stix2.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
367 stars 120 forks source link

SCO references using objects vs IDs cause different deterministic ID generation #396

Closed maybe-sybr closed 4 years ago

maybe-sybr commented 4 years ago

As shown in the following example:

import stix2.v21

pd = stix2.v21.Directory(path="/path/to")
f_obj = stix2.v21.File(name="foo", parent_directory_ref=pd)
f_ref = stix2.v21.File(name="foo", parent_directory_ref=pd.id)

print(f_obj.serialize(pretty=True))
print(f_ref.serialize(pretty=True))

outputs:

{
    "type": "file",
    "id": "file--bc403dd7-af2c-588b-a24e-d0f5150202e1",
    "name": "foo",
    "parent_directory_ref": "directory--b2d85bf6-7b9e-5de3-af38-0f856e7009ef",
    "spec_version": "2.1"
}
{
    "type": "file",
    "id": "file--f7de0951-c047-5f7d-bcc1-4075a44821f1",
    "name": "foo",
    "parent_directory_ref": "directory--b2d85bf6-7b9e-5de3-af38-0f856e7009ef",
    "spec_version": "2.1"
}

Note that the supposedly deterministic IDs for the two serialised File SCOs are different. This is caused by the handling of instances of _STIXBase in the kwargs passed to _STIXBase._generate_id(). When (in this example), the parent directory SCO is passed to _generate_id(), it makes a deep copy and adds that to the streamlined_obj_vals list which is canonicalised on a later line (which is also problematic, see #395). The deep copy of the STIX object being referenced is obviously not equivalent to the serialised reference itself, resulting in the variance in the generated ID for each object.

This can could presumably be naively fixed by making _generate_id() aware of ReferencePropertys (which I'd suggest be done since it's an easy win), but that would still leave open a potential issue where a ReferenceProperty is nested within some other ID contributing property. As a concrete example, since File SCOs include their extensions as an ID contributing property, a ReferenceProperty nested within an extension (e.g. a custom extension constructed similarly to my example code in #395) would be similarly mishandled.

maybe-sybr commented 4 years ago

Here's an example of the nested identifiers using archive-ext:

import stix2.v21

ad = stix2.v21.Directory(path="archived/path")
f_obj = stix2.v21.File(name="foo", extensions={
    "archive-ext": {
        "contains_refs": [ad, ],
    }
})
f_ref = stix2.v21.File(name="foo", extensions={
    "archive-ext": {
        "contains_refs": [ad.id, ],
    }
})

print(f_obj.serialize(pretty=True))
print(f_ref.serialize(pretty=True))

Outputs:

{
    "type": "file",
    "id": "file--7e26ad71-b261-5289-a8e3-3d7126321233",
    "name": "foo",
    "extensions": {
        "archive-ext": {
            "contains_refs": [
                "{\n    \"type\": \"directory\",\n    \"id\": \"directory--c1af83bc-164d-546d-8a73-207f7ad87324\",\n    \"path\": \"archived/path\",\n    \"spec_version\": \"2.1\"\n}"
            ]
        }
    },
    "spec_version": "2.1"
}
{
    "type": "file",
    "id": "file--916b26ba-65fd-5d95-95d7-866d24c9f98e",
    "name": "foo",
    "extensions": {
        "archive-ext": {
            "contains_refs": [
                "directory--c1af83bc-164d-546d-8a73-207f7ad87324"
            ]
        }
    },
    "spec_version": "2.1"
}

Interestingly, this also exposes another handling bug where the referenced Directory isn't rendered by its ID! So the deterministic IDs for the Files are different, and the serialised File which referenced the Directory object was incorrectly serialised.

clenk commented 4 years ago

Thank you @maybe-sybr; both of these issues you found should now be fixed in master and will be included in the next release.

maybe-sybr commented 4 years ago

Thanks for all these fixes, @clenk and @chisholm . Much appreciated!