Closed KellerFuchs closed 8 years ago
PS: This is related, but not identical, to #81.
Examples of using Link
for pagination can be found in Github's API, although they do not provide consistent answers as far as I can tell.
Hey! This isn't too obvious from the source material, but the Range
pagination is meant to use resource identifiers in exactly the same way as you're suggesting with the Link
header, so the only difference compared to what you're suggesting is cosmetic (i.e. using Range
instead of Link
).
So for example, you'd get back a range that looks like:
Next-Range: id resource_key_123...
When you feed that into an API, it should translate to something that looks roughly like:
SELECT * FROM table WHERE condition AND id >= 'resource_key_123' ORDER BY field LIMIT size
Aside from being unstable as you suggest, you should also never use offset-based pagination because its performance degrades the further you page. This Postgres presentation is still the best description of that problem out there. At Stripe, we actually had to migrate from offset-based to identifier-based pagination, and trust me that it has been a long and painful process :)
I'm going to close this out for now, but let me know if you have any other questions/comments!
@brandur I'm aware of id-based pagination, but it still doesn't solve the issue I mentioned.
Of course, you can serialize arbitrary things in the Content-Range
header, but I'm unsure that serializing, for instance, a DB cursor there wouldn't violate HTTP/1.1 semantics.
PS: The example I initially gave works for both id-based pagination and index-based pagination. This was on purpose.
@KellerFuchs In your example above, you said:
the client requests the next page, and receives e10 = e9, e11 ... e19: e9 has been seen twice by the client (unless you use id-based pagination), and it saw y but not x (despite them having been added simultaneously, regardless of whether you use id-based pagination).
Paginated by IDs does solve the problem of a client seeing duplicates. The API has told the client that the next page starts at e10
, and therefore e9
will never be seen again, even if there have been subsequent insertions.
There is definitely a more general problem in that it's very difficult for a client to paginate an entire list and be sure that it's received every item because of the possibility of subsequent insertions. You could try to use your database's isolation to keep a consistent snapshot as a client iterates, but even there you still have a few problems:
One fairly easy solution might be to use atomically incremented IDs to in effect make the list append-only.
Yes, as I mentioned id-based pagination and append-only is sufficient to get an atomic view of the result (which might be a more recent one than the one at the time you start iterating, though). Unfortunately, it's basically impossible to do this when returning sorted results.
Yes, I agree that breaking consistency (or isolation, in DB-speak) is a solution in some cases, but being utterly silent about this issue in a “best practices” guide seems shoddy.
Yeah, the state of this info in the guide remains unsatisfactory for me as well. Unfortunately there are quite a lot of different ways to do this and it depends a lot on the use cases and data stores involved. As such it's been rather tricky to pin down anything that looks particularly best-practice-esque.
One option to improve the existing usage might be to add data to the next-range which would allow the server to figure out if things have mutated. This would normally fall to If-Match
with etags or If-Not-Modified
with a timestamp (which is unsatisfactory due to lack of granularity which can end up in conflicts with fast operations). That said, the general expectation is that you would just take the Next-Range
value and pass it along (not figure out what other value to take from there and put in a different header). So perhaps something like a etag: foo
optional value in there could be used to enforce in terms of being able to bomb out if things changed (at which point you could start from the beginning of the list).
All that being said, for many use cases (ie displaying a UI), eventual consistency is good enough, with a slightly out of date list not really being too problematic.
it depends a lot on the use cases and data stores involved. As such it's been rather tricky to pin down anything that looks particularly best-practice-esque.
Hence why I suggested using Link
: it's opaque, so the developer can do whatever is appropriate for their application without risking to break HTTP semantics (or client expectations).
Fair point. We have largely expected Next-Range
to be opaque as well, which has perhaps mitigated us running into some of these issues in practice, though the definition certainly doesn't enforce this.
Section 1.6 and the Heroku API reference both recommend using
Range
for pagination.Using the
Range
/Next-Range
mechanism from HTTP accidentally exposes in the API that the data is (likely) obtained by computing the requested data within a selected window; for instance, for a query that exposes data from a SQL database:Moreover, this implementation pattern cannot provide a consistent view of the data when concurrent actions can introduce new elements at arbitrary indexes. Consider the following sequence of events:
Next-Range: 20 ..
; let's call the elementse0
toe9
;x
andy
at indexes 9 and 14;e10 = e9
,e11
...e19
:e9
has been seen twice by the client (unless you use id-based pagination), and it sawy
but notx
(despite them having been added simultaneously, regardless of whether you use id-based pagination).The right solution is to provide the API client with a consistent view of the data; for APIs returning results of DB queries, this is easily done using either cursors or materialized views, the second having the additional advantage of supporting later refinement queries.
At the API level, this should be materialized using the
Link
header, with (at least) a reference taggedrel=next
. UsingLink
allows the API implementer to store there whatever is required to designate the right query result (for instance, a DB cursor id).