Closed jordan2175 closed 5 years ago
Date Added and IDs pose a particular issue for STIX Objects. There can be multiple versions of an object per ID/data added. There has to be a way to iterate over objects when the number of versions associated with a particular Object id exceeds the item limit of a client or server. For STIX objects, a concatenation of the object id
+ modification
provides a deterministic way iterate over all of the objects (and versions) in a Collection, no matter what their size.
The goal of this implementation is to:
The server can only indicate that more results may be available. The server cannot guarantee how many, if any, additional items will be available on the next request. In the case where:
The server will respond that more Collections may be available for the client. The client retrieves the first 10 Collections from the server. However, the 11th Collection could be deleted before the client makes an additional request. The response to the second request could return 1 object (no change), 10 more objects or none.
The maximum number of items returned in a response is the lesser of client or server-specified limit. If the client indicates it can receive up to 20 items and the server allows 15, a maximum of 15 items are returned.
Manage requests for additional endpoint items using Hypermedia As The Engine Of Application State (HATEOAS). Rather than specify how the client requests the additional data, the server provides a hyperlink (URL) to retrieve the data. For example:
Client | GET [api-root]/collections/ |
---|---|
Server |
{ "collections": [ { "id": "91a7b528-80eb-42ed-a74d-c6fbd5a26116", "title": "High Value Indicator Collection", "description": "This data collection is for collecting high value IOCs", "can_read": true, "can_write": false, "media_types": ["application/vnd.oasis.stix+json; version=2.0"] }, ...], "_links": { "next": { "href": "https://taxii.foo.com/collections?after=52892447-4d7e-4f70-b94d-d7f22742ff63" } } } |
Client | GET [api-root]/collections?after=52892447-4d7e-4f70-b94d-d7f22742ff63 |
Collections, Manifests and Objects, the paginated endpoints, identify unique items differently. Collections are simple, where a unique item can be identified by the Collection ID. In TAXII servers that support object versioning, there is no single property that identifies a unique item on the Objects endpoint, because items are versions of objects. With versioning support, the Objects endpoint can uniquely identify items by concatenating the object id
and modified
properties. In TAXII servers without versioning support, the Object id
by itself uniquely identifies an item on the Objects endpoint. Internally, TAXII servers may have an opaque unique identifier each object version, depending on their database schema/strategy, such as an autogenerated primary key or a computed field.
By allowing the server to provide an opaque URL for (possible) additional items:
If we were to standardize an after
token for versioned items, a value would likely look like:
indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b--2016-11-01T03:04:05.634Z
<object-id>--<modified>
A 74 character string is verbose to identify an item.
The <api-root>/objects/
endpoint returns a STIX bundle
. If we add a HATEOAS property to bundle
, we end up polluting STIX with TAXII implementation details. Another possibility is using Link Headers like section 3.5 of RFC-8288
For server implementors with SQL databases, the initial Collections query with a 10 item limit would look like:
SELECT * FROM collections ORDER BY id LIMIT 10
The follow-on query derived from:
GET [api-root]/collections?after=52892447-4d7e-4f70-b94d-d7f22742ff63
could be:
SELECT * FROM collections WHERE id > '52892447-4d7e-4f70-b94d-d7f22742ff63' ORDER BY id LIMIT 10
If you store versions inside an objects
table as enumerations:
SELECT * FROM objects ORDER BY date_added, id LIMIT 10
SELECT * FROM objects WHERE CONCAT(CONCAT(id, '--'),modified) > 'indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b--2016-11-01T03:04:05.634Z' ORDER BY date_added, CONCAT(CONCAT(id, '--'),modified) LIMIT 10
The date added has to be in the query above because the 2.0 spec requires objects come back in "date added" order. This is kind of weird, because the versions themselves may have been added later.
If you store object versions in their own table and they have their own database-maintained integer primary key (versions.id
):
GET [api-root]/collections/52892447-4d7e-4f70-b94d-d7f22742ff63/objects?after=8675309
SELECT * FROM versions WHERE id > 8675309 ORDER BY id LIMIT 10
If your TAXII server doesn't support versions, you can just use Object ids without having to accept a modified
parameter that will never be used:
GET [api-root]/collections/52892447-4d7e-4f70-b94d-d7f22742ff63/objects?after=indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b
SELECT * FROM objects WHERE id > 'indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b' ORDER BY date_added, id LIMIT 10
For NoSQL databases, you may add a custom version_id
field for use by a secondary index.
NineFX will implement this on the server side if folks agree it's worth testing.
We have added support for pagination in TAXII 2.1. This does not solve the problem of a system adding a million records in a single database transaction and that transaction uses the same date added value for each entry (and the taxii server limits the amount of results per page to something much less than 1 million records. But this may just be an implementation specific problem that an individual vendor would need to solve.
We need to identify if pagination support is required for TAXII. We had item based pagination support in TAXII 2.0, and took it out for TAXII 2.1. The problems we ran in to were data sets that can change rapidly make item based pagination impossible. Further, item based pagination proved to be very computationally expensive for large data sets.
There is a use case where a system may add millions of records to the TAXII server in a single database transaction. Situations like this, may also record the same "date added" for each record in that database transaction. This means, that date added based filtering / pagination would not be possible.
There needs to be some sort of solution that can allow a client to tell the server, based on some monotonically increasing counter to start there and give you records either before or after that point.
Success criteria for this feature is the ability to handle rapidly changing datasets, datasets that are really large, and provide a performant solution.
The endpoints that need pagination are: GET/collections/ - see section 5.1.
GET /collections//objects/ - see section 5.3.
GET <api-root/collections//objects// - see section 5.5
GET /collections//manifest/ - see section 5.6.
The Object by ID resource can contain a significant number of object versions, which become unwieldy to manage in a single request/response pair. Without a mechanism to manage highly-versioned objects, effective transport is significantly limited.