Open lidel opened 2 years ago
Listing 50M pins is going to suck in every model, pagination or not. Cluster REST API switched to streaming pins on such requests, to avoid building up results on memory. If the pinning service API allows limit=50M it will likely use a lot of memory on the backend while building the json response (unless it encodes on the fly and crosses fingers for no errors to happen). If it allows limit=100, then dealing with such huge pinset will cost thousands of requests (but at least they can be rate-limited etc). Streaming 50M items is also not fun.
I'm not sure what the best approach is for the Pinning Service API spec. If the pinset is small enough, sure we can implement pagination and everything. But it sucks that /pins stops working if the pinset gets to certain size. If the pinset is very big, I still need to construct the answer in memory which also sucks with or without pagination.
In general /pins, without an sql-like backend that can offer good indexing and sorting to do the things, is going to suck as soon as it gets big. But that is probably not the problem of the pinning svc api spec, but of the implementation?
What is the current behavior of ipfs-cluster around GET /pins, filtering and pagination?
To be concrete, filtering is be done, limit is done, pagination not done, and, in general, the /pins endpoint is not apt for big pinsets as it will balloon the backend's memory usage.
I ran into this same issue when writing a pinning service backed by a key-value store instead of a relational database (DynamoDB, which does have some limited ability to sort and filter keys with secondary indexes).
The biggest problem I ran into is requiring "count" in responses, which is discussed here: https://github.com/ipfs/pinning-services-api-spec/issues/86.
The second problem I ran into is that the query parameters are really complex. To fully support all the variations of sorting and filters, while still returning dense results (which is a requirement for many different kinds of queries, like finding pins with certain statuses, finding pins by CID, pins by name, etc.), is hard to do in a highly-available way, even with a relational DB. If we were to overhaul the API, I'd advocate for removing many of the query params (consider removing filtering by metadata, pick a case sensitivity and be opinionated about it, only support "exact" name matches, only accept a single CID instead of a set, etc.).
I agree with Gus here. Including count becomes unreasonable when scaling, and many of the supported pagination and query parameters make fetching pins flexible for the consumers, but extremely difficult for providers.
A better model would be something similar to dynamoDB's limited response size, with on-the-spot pagination keys (nextToken/etc) and then allowing consumers/mid-tier services to filter on the received data.
You can read more about how dynamodb works at https://www.dynamodbguide.com/the-dynamo-paper/
Extracted from https://github.com/ipfs-shipyard/pinning-service-compliance/issues/118#issuecomment-1160143731:
I'd like to at the very least update the Pagination and filtering section to loosen up requirements and provide some rules of thumb for service implementations backed by key-value stores.
@hsanjuan @SgtPooki What is the current behavior of ipfs-cluster around
GET /pins
, filtering and pagination? What would be the best compromise we should document?Some ideas how to handle "sorting and filtering becomes too expensive" scenarios:
GET /pins
always returns405 Method Not Allowed
before
andafter
filters (they produce405 Method Not Allowed
),GET /pins
returns pins in random orderAre there better ways?