oasis-tcs / cti-taxii2

OASIS CTI TC: An official CTI TC repository for TAXII 2 work
https://github.com/oasis-tcs/cti-taxii2
Other
9 stars 4 forks source link

Investigate pagination support #50

Closed jordan2175 closed 5 years ago

jordan2175 commented 6 years ago

We need to identify if pagination support is required for TAXII. We had item based pagination support in TAXII 2.0, and took it out for TAXII 2.1. The problems we ran in to were data sets that can change rapidly make item based pagination impossible. Further, item based pagination proved to be very computationally expensive for large data sets.

There is a use case where a system may add millions of records to the TAXII server in a single database transaction. Situations like this, may also record the same "date added" for each record in that database transaction. This means, that date added based filtering / pagination would not be possible.

There needs to be some sort of solution that can allow a client to tell the server, based on some monotonically increasing counter to start there and give you records either before or after that point.

Success criteria for this feature is the ability to handle rapidly changing datasets, datasets that are really large, and provide a performant solution.

The endpoints that need pagination are: GET /collections/ - see section 5.1. GET /collections//objects/ - see section 5.3. GET <api-root/collections//objects// - see section 5.5 GET /collections//manifest/ - see section 5.6.

The Object by ID resource can contain a significant number of object versions, which become unwieldy to manage in a single request/response pair. Without a mechanism to manage highly-versioned objects, effective transport is significantly limited.

varnerac commented 6 years ago

Date Added and IDs pose a particular issue for STIX Objects. There can be multiple versions of an object per ID/data added. There has to be a way to iterate over objects when the number of versions associated with a particular Object id exceeds the item limit of a client or server. For STIX objects, a concatenation of the object id + modification provides a deterministic way iterate over all of the objects (and versions) in a Collection, no matter what their size.

varnerac commented 6 years ago

Iterating Across a Collection of Response Items

Scope

The goal of this implementation is to:

Limitations

The server can only indicate that more results may be available. The server cannot guarantee how many, if any, additional items will be available on the next request. In the case where:

The server will respond that more Collections may be available for the client. The client retrieves the first 10 Collections from the server. However, the 11th Collection could be deleted before the client makes an additional request. The response to the second request could return 1 object (no change), 10 more objects or none.

The maximum number of items returned in a response is the lesser of client or server-specified limit. If the client indicates it can receive up to 20 items and the server allows 15, a maximum of 15 items are returned.

Recommendation

HATEOAS

Manage requests for additional endpoint items using Hypermedia As The Engine Of Application State (HATEOAS). Rather than specify how the client requests the additional data, the server provides a hyperlink (URL) to retrieve the data. For example:

Client
GET [api-root]/collections/
Server
{
  "collections": [
  {
    "id": "91a7b528-80eb-42ed-a74d-c6fbd5a26116",
    "title": "High Value Indicator Collection",
    "description": "This data collection is for collecting high value IOCs",
    "can_read": true,
    "can_write": false,
    "media_types": ["application/vnd.oasis.stix+json; version=2.0"]
  }, ...],
  "_links":
  {
    "next":
    {
      "href": "https://taxii.foo.com/collections?after=52892447-4d7e-4f70-b94d-d7f22742ff63"
    }
  }
}      
Client
GET [api-root]/collections?after=52892447-4d7e-4f70-b94d-d7f22742ff63

Why not specify parameters?

Collections, Manifests and Objects, the paginated endpoints, identify unique items differently. Collections are simple, where a unique item can be identified by the Collection ID. In TAXII servers that support object versioning, there is no single property that identifies a unique item on the Objects endpoint, because items are versions of objects. With versioning support, the Objects endpoint can uniquely identify items by concatenating the object id and modified properties. In TAXII servers without versioning support, the Object id by itself uniquely identifies an item on the Objects endpoint. Internally, TAXII servers may have an opaque unique identifier each object version, depending on their database schema/strategy, such as an autogenerated primary key or a computed field.

By allowing the server to provide an opaque URL for (possible) additional items:

If we were to standardize an after token for versioned items, a value would likely look like:

indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b--2016-11-01T03:04:05.634Z
<object-id>--<modified>

A 74 character string is verbose to identify an item.

Problems with this Solution

The <api-root>/objects/ endpoint returns a STIX bundle. If we add a HATEOAS property to bundle, we end up polluting STIX with TAXII implementation details. Another possibility is using Link Headers like section 3.5 of RFC-8288

Implementation Examples

Collections

For server implementors with SQL databases, the initial Collections query with a 10 item limit would look like:

SELECT * FROM collections ORDER BY id LIMIT 10

The follow-on query derived from:

GET [api-root]/collections?after=52892447-4d7e-4f70-b94d-d7f22742ff63

could be:

SELECT * FROM collections WHERE id > '52892447-4d7e-4f70-b94d-d7f22742ff63' ORDER BY id LIMIT 10

Objects

If you store versions inside an objects table as enumerations:

SELECT * FROM objects ORDER BY date_added, id LIMIT 10
SELECT * FROM objects WHERE CONCAT(CONCAT(id, '--'),modified) > 'indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b--2016-11-01T03:04:05.634Z' ORDER BY date_added,  CONCAT(CONCAT(id, '--'),modified)  LIMIT 10

The date added has to be in the query above because the 2.0 spec requires objects come back in "date added" order. This is kind of weird, because the versions themselves may have been added later.

If you store object versions in their own table and they have their own database-maintained integer primary key (versions.id):

GET [api-root]/collections/52892447-4d7e-4f70-b94d-d7f22742ff63/objects?after=8675309

SELECT * FROM versions WHERE id > 8675309 ORDER BY id LIMIT 10

If your TAXII server doesn't support versions, you can just use Object ids without having to accept a modified parameter that will never be used:

GET [api-root]/collections/52892447-4d7e-4f70-b94d-d7f22742ff63/objects?after=indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b

SELECT * FROM objects WHERE id > 'indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b' ORDER BY date_added, id LIMIT 10

For NoSQL databases, you may add a custom version_id field for use by a secondary index.

Sponsorship

NineFX will implement this on the server side if folks agree it's worth testing.

References

jordan2175 commented 5 years ago

We have added support for pagination in TAXII 2.1. This does not solve the problem of a system adding a million records in a single database transaction and that transaction uses the same date added value for each entry (and the taxii server limits the amount of results per page to something much less than 1 million records. But this may just be an implementation specific problem that an individual vendor would need to solve.