ipfs / pinning-services-api-spec

Standalone, vendor-agnostic Pinning Service API for IPFS ecosystem
https://ipfs.github.io/pinning-services-api-spec/
Creative Commons Zero v1.0 Universal
99 stars 26 forks source link

Add PinStatus.created and time-based pagination #39

Closed lidel closed 4 years ago

lidel commented 4 years ago

This PR aims to address feedback about nondeterministic pagination, and inability to list pins created in specific time range:

DOCS PREVIEW: https://ipfs.github.io/pinning-services-api-spec/#specUrl=https://raw.githubusercontent.com/ipfs/pinning-services-api-spec/feat/created-timestamp/ipfs-pinning-service.yaml

Closes #38 cc #12 @obo20 @GregTheGreek @priom @jsign @sanderpick @andrewxhill

obo20 commented 4 years ago

A few things on my mind here.

lidel commented 4 years ago

@obo20

obo20 commented 4 years ago
  • I'm ok with making the default lower. Unsure if 100 or 10. Lower 10 sounds like a good default for dev/terminal experimentation and encourages clients to use explicit limit that suits their needs. Thoughts?

10 sounds perfect to me.

  • The idea is client would take created timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp in before filter).

The biggest problem I see here is the new behavior of making the queued timestamp the same as the created timestamp. If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs} endpoint then we can potentially have up to 1000 items with the same created timestamp which will make things problematic from a filtering perspective.

lidel commented 4 years ago

10 sounds perfect to me.

sgtm, set in 273d474

If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs} endpoint then we can potentially have up to 1000 items with the same created timestamp which will make things problematic from a filtering perspective.

Good catch. Value-based pagination proposed in this PR assumes PinStatus.created is unique+immutable+sortable

We could either:

After thinking about it for a moment, I think (B) would be the cleanest and least error-prone way to tackle this. I don't believe any existing vendor supports pinning multiple CIDs in a single call, and unsure how valuable it would be to users anyway: if there is a need for that, one could simply create a DAG that has all of them and pin the root.

obo20 commented 4 years ago

A and B both seem like feasible options. B certainly seems to be the easiest to implement, and it also follows the pattern of only one CID operation happening at a time that is seen in the modify / delete endpoints.

I think the spec could certainly be expanded to handle multiple requests in the future if we find that it's needed.

lidel commented 4 years ago

Ok, the rule of thumb for this API is to simplify and remove complexity where possible. Went with option (B): switched the POST /pins to accept a single Pin object.

@lanzafame I tried to find some rationale for accepting multiple objects, but found none on github. I believe you initially added it in v0.0.1 – was that just future-proofing, or is there an actual need for this?

lanzafame commented 4 years ago

It was simply a less requests the better thing, but v0.0.1 version didn't have the level of metadata creation associated with that clearly complicates how accepting an array actually plays out.

jacobheun commented 4 years ago

Perhaps we should just keep skip. Adding the created date is helpful, but skip mitigates the issues with effectively trying to skip by created date. There is value in being able to fetch results before/after certain dates, but also being able to skip would simplify these problems.

I'm not opposed to limiting 1 pin per post, but that feels like the wrong solution for this problem.

Is there a good reason to get rid of skip?

lidel commented 4 years ago

Is there a good reason to get rid of skip?

Only reason is to simplify API (replace offset-based pagination with value-based one). If we keep skip then we ask pinning services to support both pagination types. In practice supporting offsets maps nicely to DB table rows and overhead to support both is minimal, so spec should be ok with keeping skip.

@obo20 any concerns with keeping skip ?

jacobheun commented 4 years ago

Only reason is to simplify API (replace offset-based pagination with value-based one). If we keep skip then we ask pinning services to support both pagination types.

If we want to keep the api surface low we could just drop before and after filters for now. The main thing we lose here is the ability to filter results by time, which I think has the most value in client side cacheing (I fetched pins yesterday, so only give me ones after that time), but I think it's a low priority.

obo20 commented 4 years ago

@lidel the main concerns with skip go back to the initial issue we were having with our pinned records being separate from our "ongoing operations" records (queued, searching, failed). The biggest reason for this is that the majority of our users will directly pin content via an http upload (and not from the network). As such it didn't make sense to keep "successful" pins in the same location as pins that had no guarantee of succeeding. I obviously can't speak for other service providers but I wouldn't be surprised if others had a similar separation of their database records.

We can certainly query both of these tables behind the scenes and then combine / parse the results before returning them to the user. However there's a few problems with this:

Skip would be possible if we were able to logically separate "pinned" objects from "ongoing" operations (queued, searching, failed).

For reference, here are some of the options we discussed throughout various threads that would allow for offsets:

A) Only allow querying for 1 status of pins at a time. (Cleanest, but could introduce extra calls to the API).

B) Only allow for querying either "pinned" or "ongoing" statuses, but not both. (Less API calls but doesn't feel very clean to me)

C) Have providers return a server error if they can't query both pinned / ongoing statuses at once (I would very much like to avoid this)

D) Return an object with an array for each status queried. Like this:

{ pinned: [], queued: [], searching: [] }

There's a few different ways pagination would work for option D:

I'm personally l leaning towards Option A just because of how clean it is. The extra API calls don't really bother me as a service provider as any latency should be fairly minimal and requests could easily be parallelized.

lidel commented 4 years ago

Ok, did some thinking and realized we should be cognizant of developer experience here:

AFAIK we solved those footguns by removing skip filter and by making POST /pins accept only a single Pin.
@jacobheun I don't think we need skip, all pagination needs can be done in a safer manner with before alone.

:point_right: In other words, I propose we merge this PR as-is.

Below is my rationale.


Why pagination with before is safer and more efficient than skip

The main goal of this PR is to switch from offset-based pagination (skip) to value-based pagination (before)

:broken_heart: When skip is used and implemented naively, the backend needs to read entire dataset, and then drop some results. It is non-trivial to implement this safely without deeper understanding of RDB mechanics and compromising performance.

:green_heart: When before is used, the backend will only read from specified point using timestamp value column that is unique+sorted+immutable, which makes range requests efficient. This maps well to RDBs, one can leverage indexes etc.

I know problems with offset pagination can be solved with some additional work on the backend, but we have a better value-based option that works better without doing anything fancy.

To illustrate, borrowed some visuals from this blogpost:

Offset-based pagination (:broken_heart: skip)

Value-based pagination (:green_heart: before)

Footgun 1: the more records, the longer it takes to skip

blue is the time it takes to produce offset-based response, it grows with the number of records red is value-based pagination, it grows much slower

Footgun 2: if a pin is from previous batch is removed, we may skip too much

if we paginate via before, we don't have this problem because batch always starts at deterministic record