Open lidel opened 2 years ago
If not, we would break our users, and need to coordinate go-ipfs/ipfs-webui/ipfs-desktop and Pinata to make the change around the same time to limit the damage. TBD is this is acceptable, considering the performanc benfits in the long run.
Can we not just add support for a v2 (and maybe a versions) endpoint and avoid breaking changes here? IIRC when it was previously discussed that we wouldn't use libp2p and leverage multistream for protocol negotiation, but instead use HTTP the claim was that having version negotiation would be similarly easy.
Yes, if we every do this, then we would introduce this breaking change as v2 + versioning scheme.
I've updated the title of this issue + created topic/v2
for changes like this one.
So far this proposal got no feedback from any pinning service – let's park this until there is interest and/or we gather more items under topic/v2
Besides being expensive, looking at the count while paginating has problematic edge cases, e.g. pins being added/deleted while paginating.
If we're going to make a breaking change, we should consider using an opaque pagination token instead of the created
timestamp, which makes less assumptions about the capabilities of the underlying DB and can include arbitrary info to help the DB paginate better.
@lidel just now seeing this and definitely supporting this change. Count is super expensive for larger pinsets and can cause a lot of issues at scale.
I'm in favor of moving towards a created timestamp paradigm here. We had to switch our normal pinList endpoint to using this paradigm as the previous method which required counting was becoming extremely expensive for larger users.
@lidel I would also add here that getting rid of count would allow us to raise our rate limits, which is something I know the PL team / ipfs-desktop users have been asking for on this specific endpoint
I think we should remove count, and use an opaque token like "nextToken". Gus said it best:
If we're going to make a breaking change, we should consider using an opaque pagination token instead of the created timestamp, which [the pagination token] makes less assumptions about the capabilities of the underlying DB and can include arbitrary info to help the DB paginate better.
We could pretty easily implement things right away if we did time based pagination, as we already switched the rest of our application over to doing this, but I'm guessing opaque token pagination might be a lot bigger of a lift.
@SgtPooki @guseggert could you both elaborate on what you have in mind here? Any example articles you could point to?
@obo20 absolutely. First here's a quick summary in my own words:
I think of pagination tokens like nextToken
as a sort of ephemeral API key for a search/query. Basically, it's a pointer(see: cursor in cursor-based-pagination
) to a certain record. You could map this back to a time-based token if that's how your service is implemented. When you receive a request with a nextToken, you know exactly where to start searching, instead of always searching/filtering through all records for each query.
Those records can be sorted/filtered however you wish, and you can encode any sorting/filtering/pages/users/etc.. within that token. This flexibility is great, but does come with drawbacks that I've personally experienced and complained about plenty during my time at Amazon, but for service owners, it gives you a lot more control over how your resources are consumed no matter how many consumers or objects to consume you have.
I believe @guseggert implemented more than a few APIs at Amazon where we were required to use this pagination strategy when dynamoDB was our backend (it was most of the time), so he can correct me or speak further to the details if you need more info from someone who's got more experience.
Below are some great articles I've found that should help the discussion here for anyone interested. The writeup by slack is a fantastic read that goes into the pros and cons of many approaches.
Not as good as the others, but informative in a basic sense. Very specific to apollographql
Since numeric offsets within paginated lists can be unreliable, a common improvement is to identify the beginning of a page using some unique identifier that belongs to each element of the list.
If the list represents a set of elements without duplicates, this identifier could simply be the unique ID of each object, allowing additional pages to be requested using the ID of the last object in the list, together with some limit argument. With this approach, the requested cursor ID should not appear in the new page, since it identifies the item just before the beginning of the page.
--- https://www.apollographql.com/docs/react/pagination/cursor-based/
Great writeup covering unidirectional and bidirectional pagination from the view of an engineer.
For unidirectional pagination, my preferred approach is to use a simple cursor. The important detail here is to make the cursor meaningless.
As far as the client is concerned, it’s just a blob the server returns in the response when there are more results to fetch. The client shouldn’t be able to derive any implementation details from it, and the only thing it can afford to do with this cursor is to send it along in the next request.
--- https://hackernoon.com/guys-were-doing-pagination-wrong-f6c18a91b232
An extremely thorough writeup of slack's considerations when upgrading their pagination API
Cursor-based pagination works by returning a pointer to a specific item in the dataset. On subsequent requests, the server returns results after the given pointer. This method addresses the drawbacks of using offset pagination, but does so by making certain trade offs:
- The cursor must be based on a unique, sequential column (or columns) in the source table.
- There is no concept of the total number of pages or results in the set.
- The client can’t jump to a specific page.
--- https://slack.engineering/evolving-api-pagination-at-slack/
At a high level, you can encode whatever state you need to fetch the next page in the pagination token (you might want to encrypt it too, so clients can't reverse it). E.g. for a relational database, you can encode a JSON object like {"limit":20,"offset": 100}
, for key-value stores that paginate with their own tokens, use that token. Some folks need to add extra data like a timestamp, so that the token will eventually expire.
Concretely, GET /pins
would return something like:
{
"nextToken": "eyJsaW1pdCI6MjAsIm9mZnNldCI6IDEwMH0K",
"results": [...]
}
and then the subsequent request would be like GET /pins?nextToken=eyJsaW1pdCI6MjAsIm9mZnNldCI6IDEwMH0K
, repeat until you decide to stop or nextPageToken
is null.
The downside from a client's perspective is that you can't parallelize queries. That's an upside from a server's perspective, and generally a desirable thing for a multitenant API. But I'm not sure how this would impact existing impls.
Essentially though, I think we should decouple the query params / filters from the pagination mechanism, so that we don't need to make assumptions about the query capabilities of the DB in order to paginate, and the most general way that I know of to do that is with an opaque token.
@SgtPooki @lidel did we want to make official the decisions from ipfs-thing?
iirc we were discussing backward-compatible ways of removing the cost of returning exact count
value to avoid breaking existing clients.
Removing or changing type of count
will break existing clients which use it for pagination (docs). However, if we keep it, but relax the requirement around its value to be something like "equal to results
or results+1
(or MAX_INT?) to indicate there are more records" then you no longer need to pay the cost of calculating the exact size.
Next steps here (I have no bandwidth for this, but happy to review PRs):
results+1
) when the cost of calculating real one is too high.count
value is no longer exact.
PinResults.count
is used for two things:(1) Reading pin count per status (used in
ipfs pin remote service ls --stat
):(2) pagination (where it acts as an equivalent of
more: true
:Problem
PinResults.count
is expensive, especially when running expensive filters likecid
,name
orstatus
GET /pins?cid=Qm1.Qm2...Qm10
whichProposed change
PinResults.count
PinResults.more
(true/false bool)/service/pins/healthcheck
that allows inexpensive health check/service/pins/stats
that returns total pin counts for each statusThis would be a breaking change that requires v2.x.x of this spec and coordinated effort across the stack: go-ipfs, js-ipfs, ipfs-webui, ipfs-desktop and Pinata.
Is it really breaking? Needs analysis.
I've been thinking a bit about a way to do this in backward compatible way and we could have
PinResults.count
as an optional field, and always return1
so the pagination and health checks in old clients still works.:point_right: Before we consider doing this, the main question is how old client in go-ipfs will react to an unexpected
PinResults.more
– if it is ignored and nothing crashes, then we have a better path forward.If not, we would break our users, and need to coordinate go-ipfs/ipfs-webui/ipfs-desktop and Pinata to make the change around the same time to limit the damage. TBD is this is acceptable, considering the performanc benfits in the long run.
cc @obo20 @aschmahmann