Add PinStatus.created and time-based pagination

lidel commented 4 years ago

This PR aims to address feedback about nondeterministic pagination, and inability to list pins created in specific time range:

adds PinStatus.created timestamp which indicates when a pin request was registered at pinning service (queued)
- note: this is an immutable (write-once) value, it never changes
removes skip parameter from GET /pins
adds before and after filters to enable time-based pagination over items in deterministic way
- this is also faster than the old offset based mode, because less data needs to be read to produce a batch

DOCS PREVIEW: https://ipfs.github.io/pinning-services-api-spec/#specUrl=https://raw.githubusercontent.com/ipfs/pinning-services-api-spec/feat/created-timestamp/ipfs-pinning-service.yaml

Closes #38 cc #12 @obo20 @GregTheGreek @priom @jsign @sanderpick @andrewxhill

obo20 commented 4 years ago

A few things on my mind here.

Do we need to default the limit to 1000? Keeping 1000 as a maximum seems fine to me, but I'm struggling to see the use cases for needing 1000 pins at a time by default. I would rather see something like 10-100 as the default and then applications needing the higher return limit can explicitly pass in the higher limit. That way we're not being wasteful by default if the user doesn't actually need to have 1000 items returned every call.
With this new filtering by date paradigm, how do we envision applications handling instances where there are more records than the return limit? If there are more instances than the limit returned, do they just make another request where the date is a millisecond after the oldest pin in their most recently returned set of items?

lidel commented 4 years ago

@obo20

I'm ok with making the default lower. Unsure if 100 or 10. Lower 10 sounds like a good default for dev/terminal experimentation and encourages clients to use explicit limit that suits their needs. Thoughts?
The idea is client would take created timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp in before filter).

obo20 commented 4 years ago

I'm ok with making the default lower. Unsure if 100 or 10. Lower 10 sounds like a good default for dev/terminal experimentation and encourages clients to use explicit limit that suits their needs. Thoughts?

10 sounds perfect to me.

The idea is client would take created timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp in before filter).

The biggest problem I see here is the new behavior of making the queued timestamp the same as the created timestamp. If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs} endpoint then we can potentially have up to 1000 items with the same created timestamp which will make things problematic from a filtering perspective.

lidel commented 4 years ago

10 sounds perfect to me.

sgtm, set in 273d474

If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs} endpoint then we can potentially have up to 1000 items with the same created timestamp which will make things problematic from a filtering perspective.

Good catch. Value-based pagination proposed in this PR assumes PinStatus.created is unique+immutable+sortable

We could either:

(A) :yellow_heart: spec that each pin creation date needs to be unique\ (leaving it up to pinning service to ensure each object gets own timestamp that differs in ms/ns from other ones added in the same batch; sadly we need to trust there won't be any mistakes in implementation)
(B) :green_heart: change input of POST /pins to accept only one pin at a time (removes surface for introducing two pins with the same creation date)
(C) :broken_heart: switch to cursor-based pagination over some other field that is unique+immutable+sortable (in practice we dont have good candidate for this, it means adding something new or using CID: &cursor={cid-of-pin-object} + sorting results lexicographically by CID – not the best for performance and UX/DX)

After thinking about it for a moment, I think (B) would be the cleanest and least error-prone way to tackle this. I don't believe any existing vendor supports pinning multiple CIDs in a single call, and unsure how valuable it would be to users anyway: if there is a need for that, one could simply create a DAG that has all of them and pin the root.

obo20 commented 4 years ago

A and B both seem like feasible options. B certainly seems to be the easiest to implement, and it also follows the pattern of only one CID operation happening at a time that is seen in the modify / delete endpoints.

I think the spec could certainly be expanded to handle multiple requests in the future if we find that it's needed.

lidel commented 4 years ago

Ok, the rule of thumb for this API is to simplify and remove complexity where possible. Went with option (B): switched the POST /pins to accept a single Pin object.

@lanzafame I tried to find some rationale for accepting multiple objects, but found none on github. I believe you initially added it in v0.0.1 – was that just future-proofing, or is there an actual need for this?

lanzafame commented 4 years ago

It was simply a less requests the better thing, but v0.0.1 version didn't have the level of metadata creation associated with that clearly complicates how accepting an array actually plays out.

jacobheun commented 4 years ago

Perhaps we should just keep skip. Adding the created date is helpful, but skip mitigates the issues with effectively trying to skip by created date. There is value in being able to fetch results before/after certain dates, but also being able to skip would simplify these problems.

I'm not opposed to limiting 1 pin per post, but that feels like the wrong solution for this problem.

Is there a good reason to get rid of skip?

lidel commented 4 years ago

Is there a good reason to get rid of skip?

Only reason is to simplify API (replace offset-based pagination with value-based one). If we keep skip then we ask pinning services to support both pagination types. In practice supporting offsets maps nicely to DB table rows and overhead to support both is minimal, so spec should be ok with keeping skip.

@obo20 any concerns with keeping skip ?

jacobheun commented 4 years ago

Only reason is to simplify API (replace offset-based pagination with value-based one). If we keep skip then we ask pinning services to support both pagination types.

If we want to keep the api surface low we could just drop before and after filters for now. The main thing we lose here is the ability to filter results by time, which I think has the most value in client side cacheing (I fetched pins yesterday, so only give me ones after that time), but I think it's a low priority.

obo20 commented 4 years ago

@lidel the main concerns with skip go back to the initial issue we were having with our pinned records being separate from our "ongoing operations" records (queued, searching, failed). The biggest reason for this is that the majority of our users will directly pin content via an http upload (and not from the network). As such it didn't make sense to keep "successful" pins in the same location as pins that had no guarantee of succeeding. I obviously can't speak for other service providers but I wouldn't be surprised if others had a similar separation of their database records.

We can certainly query both of these tables behind the scenes and then combine / parse the results before returning them to the user. However there's a few problems with this:

Introducing an offset into that situation breaks things as there's no way of knowing how the skip should be applied to the individual tables we're querying before combining them.
Querying both sets and then combining them has the potential to double / triple the amount of "work" needed for service providers to handle these types of requests

Skip would be possible if we were able to logically separate "pinned" objects from "ongoing" operations (queued, searching, failed).

For reference, here are some of the options we discussed throughout various threads that would allow for offsets:

A) Only allow querying for 1 status of pins at a time. (Cleanest, but could introduce extra calls to the API).

B) Only allow for querying either "pinned" or "ongoing" statuses, but not both. (Less API calls but doesn't feel very clean to me)

C) Have providers return a server error if they can't query both pinned / ongoing statuses at once (I would very much like to avoid this)

D) Return an object with an array for each status queried. Like this:

{ pinned: [], queued: [], searching: [] }

There's a few different ways pagination would work for option D:

Providing an offset for each status queried
Querying for multiple statuses initially and then allowing users to query individual status's with offsets from there. (From a UX standpoint I can't think of a situation where somebody would need to offset more than one status at once)

I'm personally l leaning towards Option A just because of how clean it is. The extra API calls don't really bother me as a service provider as any latency should be fairly minimal and requests could easily be parallelized.

lidel commented 4 years ago

Ok, did some thinking and realized we should be cognizant of developer experience here:

The API should make it easy to do the right thing, even when implementer is in a rush, or lacks understanding of how DBs work.
- What this means for this PR is that pagination should not come with any hidden footguns such as slowing down when the number of records is high, or skip over valid records when implemented naively.

AFAIK we solved those footguns by removing skip filter and by making POST /pins accept only a single Pin.
@jacobheun I don't think we need skip, all pagination needs can be done in a safer manner with before alone.

:point_right: In other words, I propose we merge this PR as-is.

Below is my rationale.

Why pagination with `before` is safer and more efficient than `skip`

The main goal of this PR is to switch from offset-based pagination (skip) to value-based pagination (before)

:broken_heart: When skip is used and implemented naively, the backend needs to read entire dataset, and then drop some results. It is non-trivial to implement this safely without deeper understanding of RDB mechanics and compromising performance.

:green_heart: When before is used, the backend will only read from specified point using timestamp value column that is unique+sorted+immutable, which makes range requests efficient. This maps well to RDBs, one can leverage indexes etc.

I know problems with offset pagination can be solved with some additional work on the backend, but we have a better value-based option that works better without doing anything fancy.

To illustrate, borrowed some visuals from this blogpost:

Offset-based pagination (:broken_heart: skip)

Value-based pagination (:green_heart: before)

Footgun 1: the more records, the longer it takes to skip

blue is the time it takes to produce offset-based response, it grows with the number of records red is value-based pagination, it grows much slower

Footgun 2: if a pin is from previous batch is removed, we may skip too much

if we paginate via before, we don't have this problem because batch always starts at deterministic record

ipfs / pinning-services-api-spec