Closed jordan2175 closed 6 years ago
Proposal:
Change pagination to only use the "date added" concept Specify that if no date filters are provided, then the most current records are returned Specify that pagination can happen on the object-id URL Add X-Headers to give optional counts of collection size and number of records requested / returned: X-TAXII-Collection-Count (Optional) X-TAXII-Request-Limit (Optional) X-TAXII-Response-Limit (Required)
We talked about this at the F2F and the consensus in the room was to look for a solution that could be used to fix this. One proposal is to do sort by UUID and add an id_after url parameter to paginate through the results. One concern with date based pagination is when millions of records are adding in a single transaction and they all have the exact same date_added value. As far as getting rid of items based pagination, this is still open for discussion. It is not ideal to have items as well as this new concept. We did talk about deleting the items based pagination and also just marking it "deprecated"
When a resource changes faster than a client can page through it, pagination becomes unpredictable.
For instance, imagine a client that pages through a result set at a rate of one page every second. Each second, the server adds one page of content to the head of the resource, and removes one page of content from the tail of the resource each second. The client will effectively receive half of the dataset when paging through, while thinking they received the whole resource
Time | Resource State | Client Requests Page | Client Receives |
---|---|---|---|
0 | Page 0: AAAA Page 1: BBBB Page 2: CCCC | 0 | AAAA |
1 | Page 0: BBBB Page 1: CCCC Page 2: DDDD | 1 | CCCC |
2 | Page 0: CCCC Page 1: DDDD Page 2: EEEE | 2 | EEEE |
Any solution that involves redesigning paging will effectively destroy any value that the TAXII 2.0 specification has. In other words - there won't be any reason for people to implement it. Also, these are things that were raised and known when TAXII 2.0 was published.
For the reasons above, I strongly prefer finding a backward compatible solution, if anything is to be done at all.
We discussed on Slack this afternoon how best to deal with this change, as discussed at the F2F and over email. Mark Davidson and Andrew Varner suggested that it would be best to remove the pagination section from 2.1 CSD01 and add it back, once we figure out how it should work and we have proven that it does work. This eliminates the risk of trying to quickly rush another solution out the door. I will pull this section out of the document, after accepting all of the changes that have been proposed, and move it to the playground document. This way we can add it back as needed and once we figure out how it should work.
Besides for removing references to pagination throughout the document, I removed the following normative statement from section 3.1
TAXII responses from Endpoints that support pagination and include items as a requested range unit MUST comply with the normative requirements in section 3.4 and MUST respond with an appropriate 200, 206, or 416 response per the HTTP specification [RFC7233].
Removed the Accept-Ranges, Range, and Content-Range rows from the table in section 3.2
Removed the following sentence from section 3.3
Pagination requires a consistent sort order, and therefore multiple responses from the same endpoint need to be sorted in a consistent manner.
Removed this paragraph from section 5.3
To support requesting a large number of objects, the Endpoint supports pagination as defined in section 3.4. Clients can optionally provide their desired number of items per page and which page they want and servers will return that result set. Servers can also override client-provided pagination parameters, including requiring pagination when it isn't requested. As such, all clients should be aware that responses to this Endpoint may be paginated and be prepared to properly handle that.
I have some concerns with how we designed pagination in TAXII 2.0. The whole range aspect which is basically a limit / offset design only works well if your collections are small. Once your collections get to a certain size, regardless of the database you use (MySQL, Postgres, Oracle or even a NoSQL), the performance impact is huge and the wheels come off the bus. Apparently this is very well known issue with REST APIs and large web2.0 companies basically say, never use range (limit/offset) for REST APIs.
Now there are several ways that you can try and get around this problem (like caching data, using pages, using cursors, using result tables and doing in-memory ranges), but non of them are very good and represent a lot of unnecessary hackery. After doing a lot more reading and trying to implement this in various ways with various database backends, I am thinking that there is a better way. This solution would be pretty simple to implement and would greatly improve performance.
I propose that we drop our pagination design and just use added_after with some limit value. This would represent very little change to the overall architecture, other than dropping some sections and rewording a few normative statements. From a code stand point, it would be MUCH easier to implement.
A client could then just say, "Server give me all records after 2016". The server could say "Hey client, you can have 100 records from 2016-01-01 and the last record I am sending you is 2016-03-01". The server could also optionally tell the client that there are 20,000,000 records in the collection.
If the client did not give an added_after filter, the server could just give the client the latest records that were written to the collection, up to the limit size that the client and server both support.
From a performance standpoint, this works a lot better and results in a lot less latency.
More Details
The concepts of items (what we have today) is only valid within a given instance of a single request. There is no guarantee that the server will have the same data associated with the same item numbers for a consecutive query. The server could add or delete data between queries and the client will either get redundant data or will miss data.
About the best the server could do is hold a results table for a given TTL, and then expire it. So if the client does not make requests within that TTL, all bets are off. This solution would increase complexity and burden for no apparent value. A client will never be able to know the associated item numbers for data without the server. Meaning, if the client asked for 1,991,124,214 - 1,991,124,800, there is no predictability on what the client will get. About all the client will know is when was the last time it queried the collection and what did the server send as far as data.
So lets look at the use-cases of a client interacting with the server:
Step 1: Has the client previously talked to the collection: If Yes: Then the client will know the timestamp of the query it last used and can use that information to gather information about what has changed. The client will make a request with an added_after url parameter and the server will respond with data up to the limit that the server will support. The client could provide a limit that it will accept. If it is greater than the server limit, then the server limit wins. If the client limit is smaller than the server limit, then the client limit wins. The server will also send the ending timestamp for the data it is delivering. The client can then use this to get additional data. No: The client will just pull the collection and it will get the most current data. The server will also be able to tell the client the beginning timestamp and the ending timestamp for the data it sent and the server should be able to tell the client how many total records there are in the collection. The client can use this information to gather additional data in the future by using the added_after URL parameter.
The client can ask for additional historical data by using a added_before URL parameter.