oasis-open / cti-python-stix2

OASIS TC Open Repository: Python APIs for STIX 2
https://stix2.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
364 stars 119 forks source link

Allow optional properties with defaults to be explicitly unset #507

Closed maybe-sybr closed 3 years ago

maybe-sybr commented 3 years ago

When constructing an object with an optional property with some default value, it would be useful to be able to explicitly omit the property in exceptional circumstances.

As a concrete example, I have a custom observable type to describe "events" on a system of interest which has a property called observed. Typically I'd like that property to default to NOW to make my life easier elsewhere in the codebase, and I have a default= for the property to codify that. In exceptional circumstances, I may be sure that there is no meaningful value for when an observation was made, or I might wish to strip that attribute for the sake of comparing the data of two observed events.

I acknowledge that leaking information about the measuring system into the description of the observation itself isn't ideal but I haven't really had a better way to do this without introducing more indirection or lying about creation/modification datestamps sometimes (sometimes observations are done at different times to STIX object creation/publishing).

I'm currently being foiled by the following check which removes property values of None and empty lists from the kwargs passed to StixBase.__init__(): https://github.com/oasis-open/cti-python-stix2/blob/646aaa39a0c4f23bea0cc65b99aedb84017d4775/stix2/base.py#L151-L155

I'd like to be able to pass in a specific observed=UNSET which would cause StixBase._check_property() to skip setting a default (since it's present in the kwargs) but then pop it from the kwargs and skip cleaning the property (new behaviour) with a diff like the following:

diff --git a/stix2/base.py b/stix2/base.py
index c0e52e6..1d51e20 100644
--- a/stix2/base.py
+++ b/stix2/base.py
@@ -37,6 +37,8 @@ __all__ = ['STIXJSONEncoder', 'STIXJSONIncludeOptionalDefaultsEncoder', '_STIXBa
 DEFAULT_ERROR = "{type} must have {property}='{expected}'."
 SCO_DET_ID_NAMESPACE = uuid.UUID("00abedb4-aa42-466c-9c01-fed23315a9b7")

+UNSET = object()
+

 def get_required_properties(properties):
     return (k for k, v in properties.items() if v.required)
@@ -64,10 +66,11 @@ class _STIXBase(Mapping):
                 kwargs[prop_name] = value

         has_custom = False
-        if prop_name in kwargs:
+        unclean_prop_val = kwargs.pop(prop_name, UNSET)
+        if unclean_prop_val is not UNSET:
             try:
                 kwargs[prop_name], has_custom = prop.clean(
-                    kwargs[prop_name], allow_custom,
+                    unclean_prop_val, allow_custom,
                 )
             except InvalidValueError:
                 # No point in wrapping InvalidValueError in another

At the moment, this is mostly annoying because I'd like to make temporary new_version()s of some objects for dict-like comparision. I'm working around it by just dictifying the STIX objects, stripping the values I don't care about and then doing the comparisons. Happy to also hear any other suggestions for how I could approach the problem.

chisholm commented 3 years ago

Anyone can correct me if I'm wrong, but I think omitting a defaulted property is exactly when the default value is supposed to kick in. The net result is that a defaulted property can never truly be missing. E.g. API functions which tell you whether a property is present must always return True for a defaulted property. A defaulted property may be present because it was explicitly given or because it is implicit due to the default. But either way, it is present. A "delete" operation on a defaulted property must behave as "restore default value". (Which the stix2 library doesn't support since its objects are immutable, but one can achieve the effect by creating a new version.)

I don't think the spec is clear on this, but I think it applies the "default" concept to the abstract data model itself; it is not restricted to the JSON serialization. So a defaulted "optional" property is effectively not optional, no matter the representation. Labeling it as optional in data model documentation seems misleading. For what it's worth, I don't agree with that aspect of the spec. (If there is someplace in the spec which clarifies whether the "default" concept applies to the data model or JSON serialization, or if there is some place which clarifies how defaults are to be applied, I'd like to know. E.g. if the spec describes or implies whether it's okay to truly omit an optional defaulted property, not applying the default.)

So the stix2 library is written to ensure optional defaulted properties are always present and have at least the default value. The assumption is that to do otherwise would be noncompliant.

(Update: actually, I think the STIX spec doesn't define a defaulted optional property as often as I thought. So, not so bad. It's still there though, e.g. Sighting.summary.)

maybe-sybr commented 3 years ago

Thanks for the insight here and on #508 , @chisholm . I'll have a bit more of a think about this and whether I'm (ab)using defaults in a way that doesn't necessarily match how the spec thinks about them. :)

clenk commented 3 years ago

Depending on your codebase this might not be feasible or convenient, but would it be possible to handle defaulting observed in your code that instantiates the custom object instead of having a default on the property? In other words your instantiating code would decide whether to pass in datetime.now() or omit that property.

maybe-sybr commented 3 years ago

It sounds like my expectations are rooted in my assuming that defaults are for convenience rather than for data reduction for common cases, which feels like it might be the point from your comment above, @chisholm. I can't really see anything that seems authoritative in the STIX 2.1 spec regarding how default values should be treated, which is a bit unfortunate. As you pointed out in your edit, Sighting.summary has a default value of false, as does the common revoked property but I can't really see anywhere which describes what that actually means for implementers. As an interesting note, on master Sighting doesn't appear to define that property correctly anyway :o

>>> import uuid, stix.v21
>>> s = stix2.v21.Sighting(sighting_of_ref=f"campaign--{uuid.uuid4()}")                                                                      
>>> print(s.serialize(pretty=True, include_optional_defaults=True))                                                                          
{
    "type": "sighting",
    "spec_version": "2.1",
    "id": "sighting--c01378cb-0660-49ac-9632-04628ac14e99",
    "created": "2021-06-21T00:11:42.873611Z",
    "modified": "2021-06-21T00:11:42.873611Z",
    "sighting_of_ref": "campaign--ba57b043-e06c-423d-bc9f-52b16e278f8a",
    "revoked": false
}

I think part of the issue is that I'm also using a dynamic default value, and then being surprised when a downstream system is choosing to re-generate that, even though (for the case of observation time) it can't possibly know a sensible value for that property. By attempting to unset the property as described in the summary above, I was assuming that the system making that change had no right to do anything other than remove it, whereas your understanding of defaults suggests that it has every right to generate the default value as programmed.

Per @clenk's suggestion, I'm going to remove the default value for these properties and just deal with specifying the observation time when I generate these objects. That'll basically negate this issue and accept that the behaviour in #508 is also working as intended. Happy to close them now or continue discussion if either of you think it's worthwhile.

Since IIRC the 2.1 spec got voted on recently, perhaps the main action we could take away is that an implementers' note about default values could be written as an addendum or something to make the expectations more clear? Or if I've missed the fact that something like that already exists, let me know :) I think with the extra work done on custom definitions in 7.3 (which I'm wrapping my head around now... I have a bunch of definitions to potentially port on top of #468), having clear data model level expectations about these kind of subtleties will be really helpful for people like me.

Thanks again for the quick replies and input!

clenk commented 3 years ago

In Section 3.6 of the spec there is this:

If a property has the same value as the default, it MAY be omitted from a representation, and this does not represent a change to the object.

I believe we extrapolated our interpretation from that. There aren't any implementers' notes as far as I'm aware but I could see it being helpful. Maybe something to bring up with the TC in general if you wanted to.

Feel free to close the issues.

Thanks for pointing out the missing behavior for Sighting.summary! We'll get that fixed.

maybe-sybr commented 3 years ago

I believe we extrapolated our interpretation from that.

That makes a lot of sense. I can see how things flow logically from that statement. I'm still not entirely convinced that optionals with defaults are well enough described at the moment.

The bit in section 3.6 suggests that they should be omitted for serialisation with the expectation that any consumer MUST reconstruct the object with that same default value. That seems sensible and the schema based extensions described in CS02/3 section 7.3 would make the extension ecosystem a lot more reliable wrt that requirement.

Section 2.9 suggests that if no optional ID contributing properties are present then a UUIDv4 should be used, but it doesn't really differentiate between properties which are present by virtue of having a default value vs being truly absent. That ambiguity could lead to implementations having different deterministic ID generation behaviours depending on the interpretation I suppose. The UUIDv5s are described as SHOULD behaviours though so it's probably not a huge issue.

  • The value of the name portion SHOULD be the list of "ID Contributing Properties" (property-name and property value pairs) as defined on each SCO object and SHOULD be represented as a JSON object that is then serialized / stringified according to [RFC8785] to ensure a canonical representation of the JSON data.

  • If the contributing properties are all optional, and none are present on the SCO, then a UUIDv4 MUST be used.

I could definitely see a subsection under section 3 (general concepts), some content added to subsection 12.1 (producer/consumer conformance), and a few other bits and pieces to clarify the expectations. I'll make an issue on the TC github repo and xref back here for a bit of context.

In the meantime, removing the default from my Event definition and only setting when I have cause to has gone swimmingly because it turns out I rarely use that data anyway :) I'll close this and #508 now, but feel free to reply on any of the bits above if you care to.

chisholm commented 3 years ago

Section 2.9 suggests that if no optional ID contributing properties are present then a UUIDv4 should be used, but it doesn't really differentiate between properties which are present by virtue of having a default value vs being truly absent. That ambiguity could lead to implementations having different deterministic ID generation behaviours depending on the interpretation I suppose. The UUIDv5s are described as SHOULD behaviours though so it's probably not a huge issue.

I think of defaulting as pertaining to usage convenience. The thing being used in this context is the data model, and its usages include APIs, representations/serializations, etc. It follows that the thing being used must be defined first, before it makes sense to talk about any usages of the thing (including defaulting). I think all talk of defaulting should be removed from the definition of the data model and moved to the definition of the JSON serialization (and any other usages, as appropriate). That line about a property having the same value as the default would no longer make sense with respect to the data model, so it would be removed (or moved to JSON serialization discussion). Property selection for deterministic IDs should be defined based on the data model, not any serialization details. With no defaulting in the data model, it is very simple: either a property is present or it isn't. So there is no longer any ambiguity there. I think things would clear up a lot. And I think that was the intent with deterministic IDs. Ignore the defaulting: either a property is logically present on an object, or it isn't. That's what you use.

maybe-sybr commented 3 years ago

Hmm, interesting take. As a note, I did just briefly play around with that on master with the following script:

import stix2.properties
import stix2.v21

@stix2.v21.CustomObservable(
    "some-custom-type", properties=(
        ("foo", stix2.properties.StringProperty(default=lambda: "bar")),
        ("qux", stix2.properties.StringProperty()),
    ), id_contrib_props=(
        "foo", "qux",
    ),
)
class SomeCustomType():
    pass

o = SomeCustomType()
print(f"One might expect this to be a UUIDv4 - {o.id=}")
o = SomeCustomType()
print(f"But it's not - {o.id=}")
print(f"Because {o.foo=}")
print(o.serialize())

o = SomeCustomType(foo="bar")
print(o.id)
print(o.foo)
print(o.serialize())

o = SomeCustomType(foo="baz")
print(o.id)
print(o.foo)
print(o.serialize())

->

One might expect this to be a UUIDv4 - o.id='some-custom-type--570bad34-79b4-5d40-ad26-030daea3e167'
But it's not - o.id='some-custom-type--570bad34-79b4-5d40-ad26-030daea3e167'
Because o.foo='bar'
{"type": "some-custom-type", "spec_version": "2.1", "id": "some-custom-type--570bad34-79b4-5d40-ad26-030daea3e167"}
some-custom-type--570bad34-79b4-5d40-ad26-030daea3e167
bar
{"type": "some-custom-type", "spec_version": "2.1", "id": "some-custom-type--570bad34-79b4-5d40-ad26-030daea3e167"}
some-custom-type--5b7d4e9b-3578-5754-986f-3988d9288ad5
baz
{"foo": "baz", "type": "some-custom-type", "spec_version": "2.1", "id": "some-custom-type--5b7d4e9b-3578-5754-986f-3988d9288ad5"}

which, as expected given the current handling of default values in object instantiation, uses the default value of foo for ID generation rather than choosing to retain the UUIDv4 which the base class set.

Hopefully the TC has a chance to chat about this in a meeting soon, prompted by the issue I just put up. I'd suspect any major change to remove all concept of defaults out of the spec might need to wait for a STIX 2.2 or 3.0 depending on the TC's appetite for it. A clarification could probably land sooner since it probably wouldn't land in material change territory.