interagent / http-api-design

HTTP API design guide extracted from work on the Heroku Platform API
https://geemus.gitbooks.io/http-api-design/content/en/
Other
13.69k stars 1.07k forks source link

Pagination using Range cannot be consistent #87

Closed KellerFuchs closed 8 years ago

KellerFuchs commented 8 years ago

Section 1.6 and the Heroku API reference both recommend using Range for pagination.

Using the Range/Next-Range mechanism from HTTP accidentally exposes in the API that the data is (likely) obtained by computing the requested data within a selected window; for instance, for a query that exposes data from a SQL database:

SELECT * FROM table WHERE condition ORDER BY field LIMIT (#page * size), size

Moreover, this implementation pattern cannot provide a consistent view of the data when concurrent actions can introduce new elements at arbitrary indexes. Consider the following sequence of events:

  1. client requests the data, gets a partial answer with the first 10 elements and Next-Range: 20 ..; let's call the elements e0 to e9;
  2. a concurrent query triggers the insertion of elements x and y at indexes 9 and 14;
  3. the client requests the next page, and receives e10 = e9, e11 ... e19: e9 has been seen twice by the client (unless you use id-based pagination), and it saw y but not x (despite them having been added simultaneously, regardless of whether you use id-based pagination).

The right solution is to provide the API client with a consistent view of the data; for APIs returning results of DB queries, this is easily done using either cursors or materialized views, the second having the additional advantage of supporting later refinement queries.

At the API level, this should be materialized using the Link header, with (at least) a reference tagged rel=next. Using Link allows the API implementer to store there whatever is required to designate the right query result (for instance, a DB cursor id).

KellerFuchs commented 8 years ago

PS: This is related, but not identical, to #81. Examples of using Link for pagination can be found in Github's API, although they do not provide consistent answers as far as I can tell.

brandur commented 8 years ago

Hey! This isn't too obvious from the source material, but the Range pagination is meant to use resource identifiers in exactly the same way as you're suggesting with the Link header, so the only difference compared to what you're suggesting is cosmetic (i.e. using Range instead of Link).

So for example, you'd get back a range that looks like:

Next-Range: id resource_key_123...

When you feed that into an API, it should translate to something that looks roughly like:

SELECT * FROM table WHERE condition AND id >= 'resource_key_123' ORDER BY field LIMIT size

Aside from being unstable as you suggest, you should also never use offset-based pagination because its performance degrades the further you page. This Postgres presentation is still the best description of that problem out there. At Stripe, we actually had to migrate from offset-based to identifier-based pagination, and trust me that it has been a long and painful process :)

I'm going to close this out for now, but let me know if you have any other questions/comments!

KellerFuchs commented 8 years ago

@brandur I'm aware of id-based pagination, but it still doesn't solve the issue I mentioned. Of course, you can serialize arbitrary things in the Content-Range header, but I'm unsure that serializing, for instance, a DB cursor there wouldn't violate HTTP/1.1 semantics.

PS: The example I initially gave works for both id-based pagination and index-based pagination. This was on purpose.

brandur commented 8 years ago

@KellerFuchs In your example above, you said:

the client requests the next page, and receives e10 = e9, e11 ... e19: e9 has been seen twice by the client (unless you use id-based pagination), and it saw y but not x (despite them having been added simultaneously, regardless of whether you use id-based pagination).

Paginated by IDs does solve the problem of a client seeing duplicates. The API has told the client that the next page starts at e10, and therefore e9 will never be seen again, even if there have been subsequent insertions.

There is definitely a more general problem in that it's very difficult for a client to paginate an entire list and be sure that it's received every item because of the possibility of subsequent insertions. You could try to use your database's isolation to keep a consistent snapshot as a client iterates, but even there you still have a few problems:

  1. Although you've guaranteed consistency during a single iteration, items could still have been inserted into the middle of the list, and the client will have to re-iterate it again in its entirely to see them.
  2. Snapshots can be quite expensive for a database to hold open. Especially when they can be expected to be very long-lived.

One fairly easy solution might be to use atomically incremented IDs to in effect make the list append-only.

KellerFuchs commented 8 years ago

Yes, as I mentioned id-based pagination and append-only is sufficient to get an atomic view of the result (which might be a more recent one than the one at the time you start iterating, though). Unfortunately, it's basically impossible to do this when returning sorted results.

Yes, I agree that breaking consistency (or isolation, in DB-speak) is a solution in some cases, but being utterly silent about this issue in a “best practices” guide seems shoddy.

geemus commented 8 years ago

Yeah, the state of this info in the guide remains unsatisfactory for me as well. Unfortunately there are quite a lot of different ways to do this and it depends a lot on the use cases and data stores involved. As such it's been rather tricky to pin down anything that looks particularly best-practice-esque.

One option to improve the existing usage might be to add data to the next-range which would allow the server to figure out if things have mutated. This would normally fall to If-Match with etags or If-Not-Modified with a timestamp (which is unsatisfactory due to lack of granularity which can end up in conflicts with fast operations). That said, the general expectation is that you would just take the Next-Range value and pass it along (not figure out what other value to take from there and put in a different header). So perhaps something like a etag: foo optional value in there could be used to enforce in terms of being able to bomb out if things changed (at which point you could start from the beginning of the list).

All that being said, for many use cases (ie displaying a UI), eventual consistency is good enough, with a slightly out of date list not really being too problematic.

KellerFuchs commented 8 years ago

it depends a lot on the use cases and data stores involved. As such it's been rather tricky to pin down anything that looks particularly best-practice-esque.

Hence why I suggested using Link: it's opaque, so the developer can do whatever is appropriate for their application without risking to break HTTP semantics (or client expectations).

geemus commented 8 years ago

Fair point. We have largely expected Next-Range to be opaque as well, which has perhaps mitigated us running into some of these issues in practice, though the definition certainly doesn't enforce this.