oasis-open / cti-python-stix2

OASIS TC Open Repository: Python APIs for STIX 2
https://stix2.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
364 stars 119 forks source link

Properties ending in `_ref`/`_refs` should not be required to have a specific MRO #511

Open maybe-sybr opened 3 years ago

maybe-sybr commented 3 years ago

At the moment, we check the MRO of a property instance used in a custom SCO definition in order to determine if it is valid for use as a property with a name ending in either _ref or _refs. The relevant code snippets are:

https://github.com/oasis-open/cti-python-stix2/blob/e9d417de2592c0c7367c312ca0fd25dc8f8a9818/stix2/registration.py#L119-L128 https://github.com/oasis-open/cti-python-stix2/blob/e9d417de2592c0c7367c312ca0fd25dc8f8a9818/stix2/utils.py#L305-L310

I'm hitting issues with this because I've created and am using an annotated reference type which is set up as a DictionaryProperty itself containing a ref property which is a ReferenceProperty. Since any foo_ref annotated reference property I define is a subclass of a DictionaryProperty (and bar_refs would be a ListProperty containing DictionaryPropertys), the MRO check fails and the custom observable logic barfs on my definition. This is a problem for me since I've already used _ref(s) naming for some custom SDOs (which have no such check on properties ending in _ref(s)) and to have to use something like _refs_annotated for observables would be weird and a pain in the butt.

The section of the spec which I think this logic has come from is the first two dot points in 3.1:

Certain names of properties MUST have specific suffixes.

  • If the value of the property contains an ID reference for embedded relationships it MUST end in _ref
  • If the value of the property contains a list of embedded relationships it MUST end in _refs.

but IMO the implemeted logic is kind of backward. It states that properties containing (lists of) identifier(s) must be called something ending in _ref(s), but doesn't say that anything ending in _ref(s) has to be a (list of) identifier(s) which is what we're checking. If my custom schema states that foo_refs is a StringProperty and that doesn't contradict the spec, I figure I should be able to codify that in the implementation.

I could accept an interpretation where any property name ending in _ref(s) must contain a (list of) identifier(s), and maybe ask the TC to clarify that in the spec, if it was acceptable to say that they must either be or contain an identifier at some well-defined sub-(sub-...)property.

That leads me to the point about checking the MRO. In a previous life, that check lives in the CustomObservable decorator's validation function. When it got refactored into stix2.custom.py in b76888c6, we started using get_class_hierarchy_names() which checks the MRO rather than using an isinstance check. This seems like an odd choice and it's not really documented why it's done that way. I had planned to use abc.ABCMeta to register my annotated reference property class as a virtual subclass of ReferenceProperty as a hint to the stix2 lib that I was using something which was an "acceptable substitute" for a ReferenceProperty, but was foiled because the MRO doesn't take into account virtual subclasses which are handled through the ABCMeta.__subclasshook__ implementation.

There's also a theoretical use case for a custom of a property class which contains an identifier but doesn't inherit from ReferenceProperty, but I can't imagine why anyone would do that. In any case, virtual subclassing could also make that work if the MRO checks are changed.

I'd appreciate any comments on:

clenk commented 3 years ago

Good eye, @maybe-sybr. We should have added the check to the other object types too, not just SCOs.

Looking at the spec again, I think your interpretation is correct. I had assumed the logic went both ways (and that may have been the spec's intent, but a strict reading aligns with your interpretation). Enforcing the logic from the correct direction is a little more onerous. We'd have to examine the value of every property instead of filtering by the names.

I don't think you're trying to do anything unreasonable, since the spec doesn't say anything about dictionary keys with regards to embedded relationships. If the dictionaries in your DictionaryProperty contain ReferenceProperty and other properties, would it make sense to make them custom objects or extensions instead of dictionaries?

I don't recall the exact reason for using MRO in this specific instance, but I know I've used something similar to get around circular import errors in other places. There are other ways to get around such errors though, so I think it would be reasonable to go back to the isinstance checks.

maybe-sybr commented 3 years ago

I don't think you're trying to do anything unreasonable, since the spec doesn't say anything about dictionary keys with regards to embedded relationships. If the dictionaries in your DictionaryProperty contain ReferenceProperty and other properties, would it make sense to make them custom objects or extensions instead of dictionaries?

So a rough sanitisation of the Python code for my custom observable "event" type follows:

class AnnotatedReferenceProperty(stix.v21.DictionaryProperty):
    __KEY_TYPE_MAP = {
        "ref": stix2.properties.ReferenceProperty(
            spec_version="2.1", required=True, invalid_types=[],
        ),
    }

    def __init__(self, *args, annotation_types=None, **kwargs):
        super(AnnotatedReferenceProperty, self).__init__(*args, **kwargs)
        if annotation_types is None:
            self._type_map = self.__KEY_TYPE_MAP
        else:
            self._type_map = self.__KEY_TYPE_MAP.copy()
            self._type_map.update(annotation_types)

    def clean(self, value, allow_custom):
        d, has_custom = super(
            AnnotatedReferenceProperty, self
        ).clean(value, allow_custom)
        mk = set(k for k, v in self._type_map.items() if v.required) - set(d)
        if mk:
            raise KeyError("Missing reference annotations: {}".format(mk))
        uk = set(d) - set(self._type_map)
        has_custom |= bool(uk)
        if uk and not allow_custom:
            raise ValueError("Unknown reference annotations: {}".format(uk))
        # Clean the values while keeping track of `has_custom`
        ret = dict()
        for k, v in d.items():
            ret[k], val_is_custom = self._type_map[k].clean(v, allow_custom)
            has_custom |= val_is_custom
        return ret, has_custom

@stix2.v21.CustomObservable(
    "x-myorg-observed-event", properties=(
        # A single event SCO may describe more than one individual distinct
        # event type if those event types are coupled in some way
        ('event_types', ListProperty(
            EnumProperty(allowed=[e.value for e in EventTypes]),
            required=True,
        )),
        # Observables which are the subject of this event, e.g. a file SCO
        ('subject_refs', ListProperty(
            ReferenceProperty(
                spec_version='2.1',
                valid_types=["observed-data", "SCO"],
            ),
        )),
        # Observables which are associated with this event, e.g. a user account
        # which created or accessed a file described in `subject_refs`
        ('associated_object_refs', ListProperty(
            AnnotatedReferenceProperty(
                annotation_types={
                    "ref": ReferenceProperty(
                        spec_version="2.1", required=True,
                        valid_types=["observed-data", "SCO", ],
                    ),
                    "association_types": ListProperty(
                        EnumProperty(allowed=[
                            e.value for e in EventAssociationTypes
                        ]),
                    ),
                    "note": StringProperty(),
                },
            )
        )),
        # Raw content of the event may be stored if necessary
        ('content', StringProperty()),
        ('content_ref', ReferenceProperty(
            spec_version='2.1', valid_types='artifact',
        )),
        # When the event was generated, as opposed to when it was observed
        ('generated', TimestampProperty()),
        # We might only know a lower or upper bound for when an event occurred
        ('generated_after', TimestampProperty()),
        ('generated_before', TimestampProperty()),
        ('generated_by_ref', ReferenceProperty(
            spec_version='2.1',
            valid_types=['software', 'process']
        )),
        # When the event was observed in its described form, and by/from whom.
        # This has no default value so as to ensure that it can be omitted if
        # we either don't care when it was observed or if we can simply infer
        # that the observation was made when the STIX object was created.
        ('observed', TimestampProperty()),
        # Observables which are contextually correlated with this event, e.g.
        # preceding or subsequent events, network traffic, etc.
        ('correlates_with_refs', ListProperty(
            ReferenceProperty(
                spec_version='2.1',
                valid_types=["observed-data", "SCO"],
            )),
        ),
    ), id_contrib_props=(
        "event_types", "subject_refs",
        "content", "content_ref", "generated", "generated_by_ref",
    ),
)
class ObservedEvent(object):
    pass

Hopefully that makes it clear what I'm trying to accomplish - the associated_object_refs property is the interesting one in this case. The should be some property mutexes and whatnot in a clean() method, don't worry about that stuff :)

I don't fully understand what you mean by using a custom object or extension, unless I didn't make it clear that I was already making a custom observable and that's what you meant. Is there some way I can use a custom object in a way that would help me define the associated_object_refs more nicely?

maybe-sybr commented 3 years ago

I've just put up #513 as a draft which enforces the current requirement across all custom objects, and uses isinstance() rather than the MRO helper.

rpiazza commented 3 years ago

I agree that the spec's normative language says "id value(s) MUST use ref(s) suffix in property name", and not the other way around. However, given this fact, if the editors made any change to the spec it would probably say something "the ref(s) suffix SHOULD NOT be used unless the values are id(s).

maybe-sybr commented 3 years ago

I left a couple of comments on #513 about how I might like to solve/work around the issue with defining a "ref-like" things and having the stix2 library not barf on it: https://github.com/oasis-open/cti-python-stix2/pull/513#pullrequestreview-697757608

tl;dr My idea of using a virtual subclass described above is predicated on ReferenceProperty (or some superclass thereof) having the ABCMeta metaclass, which might not be palatable. Happy to discuss here or in response to the comments I put on the PR.

maybe-sybr commented 3 years ago

Bumping this now that it seems much of the backlog related to extensions is merged. Has anyone like @chisholm or @rpiazza had a chance to look over #513? Is there anything which I should raise on the TCS repo to get an authoritative answer on to help progress this?

chisholm commented 3 years ago

Some of my initial reaction:

The STIX generator was written with an assumption that ref/refs properties would contain STIX IDs. It essentially is looking for "connection points" within an object. Being able to following the links in a graph structure is important for checking graph-oriented things like connectedness and cycles. There needs to be a way to recognize a link when you see one. I guess auto-detecting it based on what the values "look like" may work, but it seems ugly. Doing regex and/or UUID checks would certainly not perform as well as just checking for a suffix on a property name. If you wanted to write a generator spec for your custom object and use the stix generator to randomly generate content which includes your custom object and correctly handles its references, this design would not work.

More generally, if anyone wants to treat STIX content in a graph-oriented way and have a nice simple way to identify the connection points in an object so that object graphs can be traversed, you need to enforce the naming convention. So I would prefer the spec to say if property name ends in _ref, its value MUST be a STIX ID, and similarly for _refs. This gives you a simple reliable way to find the connections.

If you want to add metadata about an SRO-based relationship, you could always add an extension to an SRO. Can't really do that with embedded relationships though, which seems like what you're trying to work around. Could you link your objects with SROs instead, and implement annotations as SRO extensions?

maybe-sybr commented 3 years ago

More generally, if anyone wants to treat STIX content in a graph-oriented way and have a nice simple way to identify the connection points in an object so that object graphs can be traversed, you need to enforce the naming convention. So I would prefer the spec to say if property name ends in _ref, its value MUST be a STIX ID, and similarly for _refs. This gives you a simple reliable way to find the connections.

In my constructed "reference-like" properties, I'm creating a required sub-property called ref which is what would be followed, but that's entirely down to my implementation and I certainly see how introducing this kind of ambiguity in the data model (even if it is resolved in an implementation) is problematic. wrt the generator, it could be taught about these "value added" references, but that's nasty and still only patches over the data model level issue.

If you want to add metadata about an SRO-based relationship, you could always add an extension to an SRO. Can't really do that with embedded relationships though, which seems like what you're trying to work around. Could you link your objects with SROs instead, and implement annotations as SRO extensions?

I've steered clear of SROs so far because having to go hunting for an external relationship which might not exist in large datasets has cratered my performance previously. In theory they're a nice solution but in practice I got bitten and learned to avoid them. When I was using them though, that was for core spec defined SCOs; since I'm in control of these objects, I could embed a reference to an SRO in the SCO as a "there is a relationship with some external object which is described using this SRO". It do have a gut feeling that is pretty sketchy from a data model perspective though.

Edit: As an extra note, the reason why I've brought this up is because I have active (although non-critical) datasets which contain these annotated reference properties already which is why I'm running on top of a staging branch with #513 merged at the moment. I'm thinking about how I can introduce a layer where I translate them as they're pulled from a datastore in the event that I need to vacate the _ref suffix as we move forward. My current idea is to implement this as part of the instantiation for the small number of objects where I've used the current AnnotatedReference property recipe, to simply rename the property to something like _aref. My reference graph traversal code descends until it finds ReferencePropertys so I think it'll all kind of work after that point. <- That idea looks hard because I'd have to do something disgusting to get access to the kwargs passed in before the base class :(

maybe-sybr commented 3 years ago

I've ended up implementing my idea from the comment above today and have only had to break in to access _STIXBase's metaclass, which isn't great but I can accept that'll likely be stable enough to move forward with. Since I operate purely in a closed Python ecosystem (which is why I'm a bit cavalier about bending STIX to my will), I'm not too worried about the fact that my older data might have busted property names, as long as I can migrate toward doing "the right thing".

So that's a potential way forward for me if it's we end up accepting the premise that _ref/_refs properties must be identifiers, codified by the checks for ReferenceProperty in this library. #513 would still be a decent change but perhaps of less immediate value since ReferenceProperty is simple enough that making subclasses wouldn't provide much value to a downstream code author. That said, it probably can't hurt.

My suggestions about minor changes on top of #513 to make it possible to add other valid "reference property types" via appending to the mapping added in #513 or via virtual subclassing could be disregarded.

I'm going to finish up with this code which should let me paper over my existing data model abuses, and depending on what we land on here or if I have to make a decision to release on top of vanilla stix2 v3.0, I'll have it ready to go.