Document caveats around key-value pinset stores

lidel commented 2 years ago

Extracted from https://github.com/ipfs-shipyard/pinning-service-compliance/issues/118#issuecomment-1160143731:

Also, is it impossible for IPFS cluster to support pagination/creation-date sorting; or is it something that hasn't been implemented yet? Is there a tracking issue for this?

It is impractical. Cluster does not have a relational-database backend for storing the pins, but just a key value store. Keys don't have sorted IDs, listing keys out from this store can result in random orders. Thus some features like pagination cannot be done without reading everything to memory, sorting, etc. which is a footgun for big pinsets. I think it is ok if cluster does not support pagination. It tries to do its best and it's quite ok that it supports everything else.

I'd like to at the very least update the Pagination and filtering section to loosen up requirements and provide some rules of thumb for service implementations backed by key-value stores.

@hsanjuan @SgtPooki What is the current behavior of ipfs-cluster around GET /pins, filtering and pagination? What would be the best compromise we should document?

Some ideas how to handle "sorting and filtering becomes too expensive" scenarios:

(a) pagination and filtering does not work at all and GET /pins always returns 405 Method Not Allowed
- simple, if someone needs this, they would use implementation backed by a database with indexes
(b) pagination and filtering works for small pinsets, but starts returning 405 Method Not Allowed` above certain number of pins
- response includes error informing user that sorting is too expensive, and they need to reduce number of pins, or track them on their own
(c) no pagination, no before and after filters (they produce 405 Method Not Allowed), GET /pins returns pins in random order

Are there better ways?

hsanjuan commented 2 years ago

Listing 50M pins is going to suck in every model, pagination or not. Cluster REST API switched to streaming pins on such requests, to avoid building up results on memory. If the pinning service API allows limit=50M it will likely use a lot of memory on the backend while building the json response (unless it encodes on the fly and crosses fingers for no errors to happen). If it allows limit=100, then dealing with such huge pinset will cost thousands of requests (but at least they can be rate-limited etc). Streaming 50M items is also not fun.

I'm not sure what the best approach is for the Pinning Service API spec. If the pinset is small enough, sure we can implement pagination and everything. But it sucks that /pins stops working if the pinset gets to certain size. If the pinset is very big, I still need to construct the answer in memory which also sucks with or without pagination.

In general /pins, without an sql-like backend that can offer good indexing and sorting to do the things, is going to suck as soon as it gets big. But that is probably not the problem of the pinning svc api spec, but of the implementation?

hsanjuan commented 2 years ago

What is the current behavior of ipfs-cluster around GET /pins, filtering and pagination?

To be concrete, filtering is be done, limit is done, pagination not done, and, in general, the /pins endpoint is not apt for big pinsets as it will balloon the backend's memory usage.

guseggert commented 2 years ago

I ran into this same issue when writing a pinning service backed by a key-value store instead of a relational database (DynamoDB, which does have some limited ability to sort and filter keys with secondary indexes).

The biggest problem I ran into is requiring "count" in responses, which is discussed here: https://github.com/ipfs/pinning-services-api-spec/issues/86.

The second problem I ran into is that the query parameters are really complex. To fully support all the variations of sorting and filters, while still returning dense results (which is a requirement for many different kinds of queries, like finding pins with certain statuses, finding pins by CID, pins by name, etc.), is hard to do in a highly-available way, even with a relational DB. If we were to overhaul the API, I'd advocate for removing many of the query params (consider removing filtering by metadata, pick a case sensitivity and be opinionated about it, only support "exact" name matches, only accept a single CID instead of a set, etc.).

SgtPooki commented 2 years ago

I agree with Gus here. Including count becomes unreasonable when scaling, and many of the supported pagination and query parameters make fetching pins flexible for the consumers, but extremely difficult for providers.

A better model would be something similar to dynamoDB's limited response size, with on-the-spot pagination keys (nextToken/etc) and then allowing consumers/mid-tier services to filter on the received data.

You can read more about how dynamodb works at https://www.dynamodbguide.com/the-dynamo-paper/

ipfs / pinning-services-api-spec

Document caveats around key-value pinset stores #97