Using UUIDv5 for SDOs and SROs

rpiazza commented 2 years ago

The specification was written to encourage use of UUIDv5 for SCOs to avoid duplication of objects that represent the same thing - e.g., an IP address. There is an algorithm in the spec that should be used to generate the UUIDv5 ids, based on specified properties for each SCO and an explicitly defined namespace. Other algorithms may be used, as described in this text from section 3.4:

The default algorithm that creates the SCO ID based on those named properties is a UUIDv5 as defined in Section 2.9, however, other algorithms for creating the SCO ID MAY be used.

Using UUIDv5 ids for SDO/SROs is not explicitly discussed in the spec, but is not explicitly prohibited either. The following text from section 2.9 can imply that UUIDv5 ids can be used for them:

STIX Domain Objects, STIX Relationship Objects, STIX Meta Objects, and STIX Bundle Object SHOULD use UUIDv4 for the UUID portion of the identifier. Producers using something other than UUIDv4 need to be mindful of potential collisions and should use a namespace that guarantees uniqueness, however, they MUST NOT use a namespace of 00abedb4-aa42-466c-9c01-fed23315a9b7 if generating a UUIDv5.

There is at least one use case for using UUIDv5 ids for SDOs - representing CVEs using the Vulnerability SDO.

It was recognized that having many duplicate Vulnerability objects to represent a particular CVE is not ideal. For this reason, the common STIX object repository includes a "canonical" Vulnerability object for each CVE, and the repository is updated nightly to include the CVEs created that day.

However, because of the large number of CVEs (over 100000) this seems not to be an ideal solution. A simpler solution would be to use a UUIDv5, based on the CVE id. All producers could determine what the appropriate vulnerability id is without having to store the object or obtain it from the common object repository, and just use the id for references to the CVE.

Based on the text from section 2.9, this is already possible to do, but the explicit namespace CAN NOT BE USED. This implies that producers would pick a namespace, which would most likely differ from other producers, defeating the whole purpose of the use of UUIDv5s. Of course, this namespace could be published so it is known to the community - but that seems problematic.

The proposal suggested in this issue is to explicitly allow the use of UUIDv5 for certain SDO/SROs.

dc3-tsd commented 2 years ago

Relationships have a strong use case for UUIDv5’s when recording DNS resolutions. While you can store resolution information for domain-names in the resolves_to_refs property this doesn’t permit time bounding this information. As such any time data for it would need to come from an observed data containing both.

Unfortunately this does not effectively convey the information since the first_observed and last_observed fields of observed-data only tell you that this resolution occurred X times within this time range not that it held true for the entire duration of this range.

Instead using an external relationship makes this far easier. A relationship has an explicit start_time and end_time that makes it very clear that this is the exact time period where this DNS resolution held true.

Using a UUIDv5 based relationship with created set equal to the start_time makes it easy for a very fast distributed mapping of this resolution that can stream updates without an issue or a requirement to re-architect systems to store STIX IDs internally. While the resolution is still valid stop_time is omitted then the current description of stop_time describes our understanding of this perfectly:

“If stop_time is not specified, then the latest time at which the relationship between the objects exists is either not known, not disclosed, or has no defined stop time.”

This technique allows any vendor that can map things like domain resolution or certificate hosting history over time already to quickly provide a STIX output that is easy to ingest for existing systems. Our current proposal suggests the following properties to be used to generate a UUIDv5 for relationships: relationship_type, labels, source_ref, target_ref, created_by_ref, created, and start_time.

If it lapses and DNS had a hole in it then a new ID would be generated for the new start time with no relationships showing resolution in that time window

pcoccoli commented 1 year ago

I find it odd that SCOs use UUIDv5 but the Observed Data SDO doesn't. Observed Data SDO is effectively a container for SCOs, logically equivalent to a log event. Having a unique ID here helps a ton in data deduplication. A use case is stix-shifter: if you run the same STIX pattern (bounded with START and STOP) against a data source twice, you expect to see the same observations.

rpiazza commented 1 year ago

@pcoccoli - I'm not sure I understand your question. Yes - the observed data object is essentially a log event. Wouldn't each log event have different first observed/last observed time. If you are seeing the same event over and over, you could create a new version of that observed data object, with updated last observed time and keep track of the number of times you saw the event in the number_observed property.

SYNchroACK commented 5 months ago

The proposal suggested in this issue is to explicitly allow the use of UUIDv5 for certain SDO/SROs.

IMHO, this sentence alone raises problems regarding the STIX principles for versioning, but if your point was strictly on the thing you said, I would agree:

"A simpler solution would be to use a UUIDv5, based on the CVE id. All producers could determine what the appropriate vulnerability id is without having to store the object or obtain it from the common object repository, and just use the id for references to the CVE."

However, this also means that people can only use OASIS STIX Namespace to determine the STIX ID of the Vulnerability object but NEVER to generate new vulnerability objects with that ID.

I believe this raises a need to something that I've been spoting which is a library on top of stix2 that deals with this use cases.

SYNchroACK commented 5 months ago

@rpiazza what about chat with cve.org guys to also provide stix version in their repo?

Last year they launched the JSON 5 Format, this year with MITRE help they could launch a version with STIX format.

This approach ensures that the source of vulnerabilities management is also the producer of the stix objects and keep them updated following STIX principles.

SYNchroACK commented 5 months ago

I find it odd that SCOs use UUIDv5 but the Observed Data SDO doesn't. Observed Data SDO is effectively a container for SCOs, logically equivalent to a log event. Having a unique ID here helps a ton in data deduplication. A use case is stix-shifter: if you run the same STIX pattern (bounded with START and STOP) against a data source twice, you expect to see the same observations.

@pcoccoli Yes, totally understand your point. The only reason an Observed Data, as it is today, cannot be an SCO with a UUIDv5 is because of its field number_observed. An SCO does not have versions, therefore, is not expected to be changed accross the time. That number_observed field makes a case for keep updating the Observed Data object in order to avoid a lot of Observed Data objects everytime you see the same "observation".

One possible approach to convert Observed Data to an object with no versions (SCO UUIDv5) would be something like:

removing the fields:
- first_observed
- last_observed
- number_observed
adding a field like:
- timestamp.

Again, this will force to generate a lot of observed-data objects.

So, I believe the TC went to the current approach in order to avoid a lot of objects, even though does not allow deduplication like you would understandably expect.

jordan2175 commented 5 months ago

@SYNchroACK Observed Data is a deprecated object. It is an artifact from when we first started building STIX 2. It represents a Graph inside of a Graph. The reason we went that way is we did not want every IP address to have a unique ID. It was not until we better understood how to use UUIDv5 that we looked at making that change.

To address your other comment, SCOs are "facts" or empirical data that does not change and is not open to debate or confidence or other bits of data. You connect SCOs to intelligence and that intelligence can change and what not, or be added to. This is why there are UUIDv4 addresses for SDOs and UUIDv5 for SCOs.

SYNchroACK commented 5 months ago

@SYNchroACK Observed Data is a deprecated object. It is an artifact from when we first started building STIX 2. It represents a Graph inside of a Graph. The reason we went that way is we did not want every IP address to have a unique ID. It was not until we better understood how to use UUIDv5 that we looked at making that change.

To address your other comment, SCOs are "facts" or empirical data that does not change and is not open to debate or confidence or other bits of data. You connect SCOs to intelligence and that intelligence can change and what not, or be added to. This is why there are UUIDv4 addresses for SDOs and UUIDv5 for SCOs.

@jordan2175 You mean this Observed Data is deprecated? https://docs.oasis-open.org/cti/stix/v2.1/os/stix-v2.1-os.html#_p49j1fwoxldc

dc3-tsd commented 5 months ago

I think there might be some confusion here. The usage of the objects property within Observed Data is deprecated in favor of object_refs. Observed Data itself is still very much supported and required for a number of use cases including for Sightings.

In this context the usage of deterministic IDs for both Observed Data and Sightings (as a type of relationship) would likely be extremely useful to prevent data duplication.

SYNchroACK commented 5 months ago

I think there might be some confusion here. The usage of the objects property within Observed Data is deprecated in favor of object_refs. Observed Data itself is still very much supported and required for a number of use cases including for Sightings.

Yup, exactly!

In this context the usage of deterministic IDs for both Observed Data and Sightings (as a type of relationship) would likely be extremely useful to prevent data duplication.

Well, in fact, even Relationship object should have a deterministic ID, however, with the current core structure of the objects, that cannot be achieved. In order to met that goal (which I totally agree), there is a need for a core restructure splitting objects in the following types:

Particles

An object with or without deterministic IDs which represents a set of properties like the following, that must always have an embedded reference to an Atom object:

OS Timestamps (atime, ctime, mtime, operating_system)
Hashes (md5, sha1, sha256)
Assertion Timestamps (start_time, stop_time, created_by_ref)
Sight Timestamps (first_seen, last_seen, count, created_by_ref, created, modified)
Observed Timestamps (first_observed, last_observed, count, created_by_ref, created, modified)
Descriptive Context (description, description_type, created_by_ref, created, modified)
Marking Attachment (marking_ref, object_ref, selectors, created_by_ref, created, modified)
Intrusion Set Context (description, resource_level, goals, ..., created_by_ref, created, modified)

Notes

A particle ID may be UUIDv4 or UUIDv5 depending on the scenario:

Hashes, clearly a case for UUIDv5
Sight Timestamps, a potential candidate for UUIDv5, even though count property may need to be revisited.
Descriptive Context, clearly a case for UUIDv4

In practice, a particle can have a deterministic ID if the producer will never have to update it, otherwise, the versioning mechanism needs to be in place (like in stix 2.1) which then makese the case to use UUIDv4.

Atoms

An object with deterministic IDs which represents base STIX element like:

IPv4 Address (only with value property),
Directory (only with path property) with OS Timestamps particle
Intrusion-Set (only with canonical name) with Intrusion Set Context particle

Notes

On objects that represent threats like Threat Actor, Intrusion-set, Malware:

first_seen and last_seen will be replaced by the use of Sighting object
aliases will be replaced by Relationship object with a new relationship_type alias-of in order to track keep track of who did that link, when and what is the confidence level of that assertion.

Molecules

An object with deterministic IDs which represents a set of Atom objects like:

Sighting (only with sighting_of_ref, observed_data_refs, where_sighted_refs) and the rest of properties should be particles (Sight Timestamps, Descriptive Context), pointing to the Sighting.
Observed Data (only with object_refs) and the rest of properties should be particles (Observed Timestamps), pointing to the Observed Data.
Relationship (only with source_ref, target_ref, relationship_type) and the rest of properties should be particles (Assertion Timestamps, Descriptive Context), pointing to the Relationship.

Compounds

An object without deterministic IDs which represents a special set of Atom objects like:

Report ...
Incident ...
- *

I have a draft of a proposal for a possible stix 3.0, in case you find it interesting, ping me. ;)

oasis-tcs / cti-stix2