Closed lidel closed 4 years ago
A few things on my mind here.
Do we need to default the limit to 1000? Keeping 1000 as a maximum seems fine to me, but I'm struggling to see the use cases for needing 1000 pins at a time by default. I would rather see something like 10-100 as the default and then applications needing the higher return limit can explicitly pass in the higher limit. That way we're not being wasteful by default if the user doesn't actually need to have 1000 items returned every call.
With this new filtering by date paradigm, how do we envision applications handling instances where there are more records than the return limit? If there are more instances than the limit returned, do they just make another request where the date is a millisecond after the oldest pin in their most recently returned set of items?
@obo20
limit
that suits their needs. Thoughts?created
timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp in before
filter).
- I'm ok with making the default lower. Unsure if 100 or 10. Lower 10 sounds like a good default for dev/terminal experimentation and encourages clients to use explicit
limit
that suits their needs. Thoughts?
10 sounds perfect to me.
- The idea is client would take
created
timestamp from the oldest result in the batch and ask for a batch older than that (passing the timestamp inbefore
filter).
The biggest problem I see here is the new behavior of making the queued
timestamp the same as the created
timestamp. If we allow people to pass in up to 1000 items as part of the POST/pins{array of CIDs}
endpoint then we can potentially have up to 1000 items with the same created
timestamp which will make things problematic from a filtering perspective.
10 sounds perfect to me.
sgtm, set in 273d474
If we allow people to pass in up to 1000 items as part of the
POST/pins{array of CIDs}
endpoint then we can potentially have up to 1000 items with the samecreated
timestamp which will make things problematic from a filtering perspective.
Good catch.
Value-based pagination proposed in this PR assumes PinStatus.created
is unique+immutable+sortable
We could either:
POST /pins
to accept only one pin at a time
(removes surface for introducing two pins with the same creation date)&cursor={cid-of-pin-object}
+ sorting results lexicographically by CID – not the best for performance and UX/DX)After thinking about it for a moment, I think (B) would be the cleanest and least error-prone way to tackle this. I don't believe any existing vendor supports pinning multiple CIDs in a single call, and unsure how valuable it would be to users anyway: if there is a need for that, one could simply create a DAG that has all of them and pin the root.
A and B both seem like feasible options. B certainly seems to be the easiest to implement, and it also follows the pattern of only one CID operation happening at a time that is seen in the modify / delete endpoints.
I think the spec could certainly be expanded to handle multiple requests in the future if we find that it's needed.
Ok, the rule of thumb for this API is to simplify and remove complexity where possible.
Went with option (B): switched the POST /pins
to accept a single Pin
object.
@lanzafame I tried to find some rationale for accepting multiple objects, but found none on github. I believe you initially added it in v0.0.1 – was that just future-proofing, or is there an actual need for this?
It was simply a less requests the better thing, but v0.0.1 version didn't have the level of metadata creation associated with that clearly complicates how accepting an array actually plays out.
Perhaps we should just keep skip
. Adding the created date is helpful, but skip
mitigates the issues with effectively trying to skip by created date. There is value in being able to fetch results before/after certain dates, but also being able to skip would simplify these problems.
I'm not opposed to limiting 1 pin per post, but that feels like the wrong solution for this problem.
Is there a good reason to get rid of skip
?
Is there a good reason to get rid of
skip
?
Only reason is to simplify API (replace offset-based pagination with value-based one).
If we keep skip
then we ask pinning services to support both pagination types.
In practice supporting offsets maps nicely to DB table rows and overhead to support both is minimal, so spec should be ok with keeping skip
.
@obo20 any concerns with keeping skip
?
Only reason is to simplify API (replace offset-based pagination with value-based one). If we keep skip then we ask pinning services to support both pagination types.
If we want to keep the api surface low we could just drop before
and after
filters for now. The main thing we lose here is the ability to filter results by time, which I think has the most value in client side cacheing (I fetched pins yesterday, so only give me ones after that time), but I think it's a low priority.
@lidel the main concerns with skip go back to the initial issue we were having with our pinned records being separate from our "ongoing operations" records (queued, searching, failed). The biggest reason for this is that the majority of our users will directly pin content via an http upload (and not from the network). As such it didn't make sense to keep "successful" pins in the same location as pins that had no guarantee of succeeding. I obviously can't speak for other service providers but I wouldn't be surprised if others had a similar separation of their database records.
We can certainly query both of these tables behind the scenes and then combine / parse the results before returning them to the user. However there's a few problems with this:
Skip would be possible if we were able to logically separate "pinned" objects from "ongoing" operations (queued, searching, failed).
For reference, here are some of the options we discussed throughout various threads that would allow for offsets:
A) Only allow querying for 1 status of pins at a time. (Cleanest, but could introduce extra calls to the API).
B) Only allow for querying either "pinned" or "ongoing" statuses, but not both. (Less API calls but doesn't feel very clean to me)
C) Have providers return a server error if they can't query both pinned / ongoing statuses at once (I would very much like to avoid this)
D) Return an object with an array for each status queried. Like this:
{ pinned: [], queued: [], searching: [] }
There's a few different ways pagination would work for option D:
I'm personally l leaning towards Option A just because of how clean it is. The extra API calls don't really bother me as a service provider as any latency should be fairly minimal and requests could easily be parallelized.
Ok, did some thinking and realized we should be cognizant of developer experience here:
AFAIK we solved those footguns by removing skip
filter and by making POST /pins
accept only a single Pin
.
@jacobheun I don't think we need skip
, all pagination needs can be done in a safer manner with before
alone.
:point_right: In other words, I propose we merge this PR as-is.
Below is my rationale.
before
is safer and more efficient than skip
The main goal of this PR is to switch from offset-based pagination (skip
) to value-based pagination (before
)
:broken_heart: When skip
is used and implemented naively, the backend needs to read entire dataset, and then drop some results. It is non-trivial to implement this safely without deeper understanding of RDB mechanics and compromising performance.
:green_heart: When before
is used, the backend will only read from specified point using timestamp value column that is unique+sorted+immutable, which makes range requests efficient. This maps well to RDBs, one can leverage indexes etc.
I know problems with offset pagination can be solved with some additional work on the backend, but we have a better value-based option that works better without doing anything fancy.
To illustrate, borrowed some visuals from this blogpost:
Offset-based pagination (:broken_heart:
skip
)Value-based pagination (:green_heart:
before
)Footgun 1: the more records, the longer it takes to
skip
blue is the time it takes to produce offset-based response, it grows with the number of records red is value-based pagination, it grows much slower
Footgun 2: if a pin is from previous batch is removed, we may
skip
too muchif we paginate via
before
, we don't have this problem because batch always starts at deterministic record
This PR aims to address feedback about nondeterministic pagination, and inability to list pins created in specific time range:
PinStatus.created
timestamp which indicates when a pin request was registered at pinning service (queued)skip
parameter fromGET /pins
before
andafter
filters to enable time-based pagination over items in deterministic wayDOCS PREVIEW: https://ipfs.github.io/pinning-services-api-spec/#specUrl=https://raw.githubusercontent.com/ipfs/pinning-services-api-spec/feat/created-timestamp/ipfs-pinning-service.yaml
Closes #38 cc #12 @obo20 @GregTheGreek @priom @jsign @sanderpick @andrewxhill