oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
243 stars 36 forks source link

Tracking: Pagination #2449

Open david-crespo opened 1 year ago

david-crespo commented 1 year ago

Going through pages of results in the external API is not very good right now. The biggest problem is that you can only go forward. If you have a cursor pointing to a page, there is simply no way to access the previous page without starting over at the beginning of the collection and paging through. Details are in the Dropshot issue, but most likely we need to be able to ask for the page before a given cursor, which would require splitting the existing page_token param into two: starting_from (same behavior as current page_token) and ending_before.

Another problem is that we will always give a next_page cursor regardless of whether there are actually any more items. As discussed in the Dropshot issue, we need to give the consumer (Nexus) a way of telling Dropshot whether there should be a next_page param, and it would be up to Nexus to determine the criterion. A very simple one that would require changing very little about our code would be to only include next_page if the current page's size equals the current page size limit. This would have a 1/limit chance of being wrong, which is not that bad. A more robust but more complex approach would involve fetching limit + 1 results for a given page so we would actually know whether there are more items. The extra precision here is somewhat undermined by the fact that the actual number of items available could change between requests regardless.

This mostly depends on work that would take place in Dropshot, so fittingly both of these issues are captured there.

- [ ] https://github.com/oxidecomputer/dropshot/issues/436
- [ ] https://github.com/oxidecomputer/dropshot/issues/20
- [ ] https://github.com/oxidecomputer/omicron/issues/2343
wez commented 3 months ago

Hi! I've been playing with dropshot since @sunshowers mentioned it to me. I'm generally positive about it, but the pagination interface feels a bit awkward to me. I suspect that you had a different higher level approach in mind from what I want to do when this was designed, and that I may be running in that impedance mismatch! I wanted to share it here where it appears that you're making design considerations. If this is not the best place to share this, I apologize in advance!

Note that I'm not currently using an openapi-generated client to access this endpoint, which may be a factor in my impressions around this interface.

The context for my application is that I have an endpoint that performs a sqlite full text query. I have a simple q GET parameter that contains the full-text query string. The goal of the endpoint is to pass q to sqlite and return the results. For pagination, what I want to do is pass in LIMIT and OFFSET to the underlying query.

In addition to the list of things you've collected above, a couple of things that feel surprising/awkward to me are:

$ curl 'http://192.168.1.92:4242/search/songs?page=eyJ2IjoidjEiLCJwYWdlX3N0YXJ0Ijp7InEiOiJsb3ZlIiwib2Zmc2V0IjozfX0=' -s | jq .
{
  "request_id": "ea77c161-5b23-49ed-9716-79ff2ccf6609",
  "message": "unable to parse query string: missing field `q`"
}
For context/clarity, here are my type definitions: ```rust #[derive(Debug, Deserialize, JsonSchema)] struct SongSearchParams { q: String, } #[derive(Debug, Deserialize, Serialize)] struct SongSearchPage { q: String, offset: usize, } #[endpoint { method = GET, path = "/search/songs" }] async fn search_songs( request_context: RequestContext, query: Query>, ) -> Result>, HttpError>; ```

I'm not sure that there is benefit to me in trying to conform to the pagination interface as it stands now; the bifurcation of the query logic around the two WhichPage enum variants adds complexity to the underlying query construction, and it feels overall much simpler and easier to follow if the pagination parameters are explicitly part of the interface, for both the client and the server:

#[derive(Debug, Deserialize, JsonSchema)]
struct SongSearchParams {
    q: String,
    limit: Option<u64>,
    offset: Option<u64>,
}

#[endpoint {
    method = GET,
    path = "/search/songs"
}]
async fn search_songs(
    request_context: RequestContext<ServerState>,
    query: Query<SongSearchParams>,
) -> Result<HttpResponseOk<Vec<SongMetadata>>, HttpError>;

Am I missing something important that led to the current shape of the pagination interface?

jgallagher commented 3 months ago

I wonder if there should be a flavor of PaginationParams that is more directly geared to the SQL LIMIT,OFFSET pattern?

I'd defer to the folks who came up with that interface, but I think it's intentionally designed to encourage cursor-based pagination instead of LIMIT,OFFSET-based pagination. LIMIT M OFFSET N requires the database to still fetch those M + N rows, so as you get later in the page count it's having to discard all the rows from the previous pages for each query. (I think it's also more fragile in the face of the underlying data changing, although pagination over a changing dataset is always a little tricky..)

davepacheco commented 3 months ago

Yeah, so the consumer model we were going for is described in this part of the docs. The rationale is described in this comment. Certainly, my experience has been that limit/offset pagination is strongly discouraged for the reasons @jgallagher mentioned, which are described in more detail in the linked blog post. To summarize: limit/offset leads to invisible inconsistencies (you can miss items during a scan and have no way to know that that's happened) and the implementation usually doesn't scale well (OFFSET N LIMIT M usually requires reading N + M rows, so that scanning a whole collection this way is usually O(n^2)). The sqlite docs say this, too (under "What Not To Do" at the bottom). There are surely use cases where these are not problems, and so you can do this in Dropshot if you want, but we definitely wanted to steer our own consumers away from limit/offset pagination.

The interface assumes that there is a way to seek data based on a row-level ID, but this is an uncommon capability in RDBMS where the SQL LIMIT, OFFSET is the usual approach. I can bypass this concept by simply ignoring the item parameter in the get_page_selector callback and instead compute my own offset parameter, but I can't help but feel like I'm immediately using this incorrectly.

I think this the right outcome: we want to make it hard to accidentally do this, but if someone understands the tradeoffs and wants to do it, it works.

There is redundancy in Query parameter passing: I have to pass in my q parameter for subsequent pages, even though the interface requires that I redundantly encode ScanParams (which contains my q) into PageSelector for the WhichPage::Next variant.

You shouldn't have to pass q into requests for pages after the first one. You gave this example:

$ curl 'http://192.168.1.92:4242/search/songs?page=eyJ2IjoidjEiLCJwYWdlX3N0YXJ0Ijp7InEiOiJsb3ZlIiwib2Zmc2V0IjozfX0=' -s | jq .
{
  "request_id": "ea77c161-5b23-49ed-9716-79ff2ccf6609",
  "message": "unable to parse query string: missing field `q`"
}

You used the query parameter page, but it's called page_token. I took your types, put them into a basic dropshot server (based on dropshot/examples/basic.rs), passed in a valid page_token, and I do not get any error. Relatedly: if you did pass q in the query string to silence that error, the request would not have been doing what you think. It was ignoring page altogether and so ignoring offset and so you'd be getting the first page of results again.


@jgallagher wrote:

although pagination over a changing dataset is always a little tricky

This is true! To get specific: using what that blog post calls keyset pagination, if a client scans a collection in the usual way, it can be sure that it saw any items that existed for the duration of the scan and whose position in the sorted collection (relative to others) didn't change during the scan (e.g., if you're sorting by name, you can think of this as "it wasn't renamed"). There are still plenty of caveats: if an item is added, removed, or reordered (renamed) during the scan, the client may or may not see it. That's usually manageable, though. Limit/offset is much worse in that say a client has retrieved 3 of 10 pages (pages 1, 2, and 3) and then an item is removed from page 1. The client will proceed requesting page 4 and miss the item that was at the top of page 4, now the bottom of page 3. The element that the client misses was not changed by the mutation at all -- it was not added, removed, nor reordered/renamed during the scan, yet the client missed it. That's pretty hard to deal with.

wez commented 3 months ago

You used the query parameter page, but it's called page_token

Ah, I see my mistake. I think this is because the example code for the server doesn't really show those parameters, and the docs around the pagination types, beyond the brief mention in the "Support for paginated resources" section, don't really mention page_token, making it easy to overlook when flipping back and forth between the examples and sections of the docs; there are quite a few distinct pages to flip between.

[other commentary on pagination]

I understand the rationale, thanks for clarifying!

My perception is that the design is (justifiably) opinionated about the approach, while also being very hands-off in terms of pointing people to the intended path in the docs, which presents as a bit of a wide gulf for the casual/external consumer to adopt. If you don't mind me (literally) typing this comment from my armchair(!), my suggestion would be to briefly explain in the docs that the intent is to use something like a cursor-based approach because of the performance characteristics of a simplistic LIMIT, OFFSET, so that at least part of that wonderful long private-module comment is more discoverable.

I think it would also be very helpful to include an appropriate RDBMS example in the repo, because the existing pagination examples are all a bit far removed from how most "real" data sources work, which I think adds to the difficulty in more smoothly adopting this interface.

Stating things a bit more succinctly: I think the current docs and examples don't really support/serve someone coming in cold who wants to hook up what they might perceive to be a "simple" RDBMS-backed query, so it would be great to give a bit more of a directed example with just a little bit of exposition about why, so that it is easier for them to fall into the pit of success.

davepacheco commented 3 months ago

You used the query parameter page, but it's called page_token

Ah, I see my mistake. I think this is because the example code for the server doesn't really show those parameters, and the docs around the pagination types, beyond the brief mention in the "Support for paginated resources" section, don't really mention page_token, making it easy to overlook when flipping back and forth between the examples and sections of the docs; there are quite a few distinct pages to flip between.

I hear that. Any ideas to improve it? When writing this, I felt like it was hard to describe more concisely than a complete example, but that's complicated enough to be a whole program which means it's sort of outside the main docs. (Is there a spot where the docs refer to it as page?)

I think it would also be very helpful to include an appropriate RDBMS example in the repo, because the existing pagination examples are all a bit far removed from how most "real" data sources work, which I think adds to the difficulty in more smoothly adopting this interface.

Hmm. The intent of the pagination-basic example is exactly that, using a BTreeMap as a standin for the RDBMS. The idea was that the only things you'd change for a "real" RDBMS would be https://github.com/oxidecomputer/dropshot/blob/0560bedfb265e751c9f4bf4ea15a4bac85678f6b/dropshot/examples/pagination-basic.rs#L79-L83 and https://github.com/oxidecomputer/dropshot/blob/0560bedfb265e751c9f4bf4ea15a4bac85678f6b/dropshot/examples/pagination-basic.rs#L86-L90. (I should check if that's actually true, based on what we did in Omicron.) Maybe it would help if we created some named types to make that clearer? e.g., a struct Database that contains the BTreeMap, with methods that look like RDBMS methods? The flip side is it's more complexity in the example.

We could have one that actually accesses a real RDBMS? That'd presumably add more deps and a lot of the code would not be relevant for others' RDBMS's, but maybe that's worth it if it's clearer how to use it?

davepacheco commented 3 months ago

You used the query parameter page, but it's called page_token

Ah, I see my mistake. I think this is because the example code for the server doesn't really show those parameters, and the docs around the pagination types, beyond the brief mention in the "Support for paginated resources" section, don't really mention page_token, making it easy to overlook when flipping back and forth between the examples and sections of the docs; there are quite a few distinct pages to flip between.

I hear that. Any ideas to improve it? When writing this, I felt like it was hard to describe more concisely than a complete example, but that's complicated enough to be a whole program which means it's sort of outside the main docs. (Is there a spot where the docs refer to it as page?)

Replying to myself: the more complicated pagination example includes in a block comment an example of running through it with curl: https://github.com/oxidecomputer/dropshot/blob/main/dropshot/examples/pagination-multiple-sorts.rs#L15-L63

The simple pagination example just says "Try passing different values of the limit query parameter. Try passing the next page token from the response as a query parameter, too." Maybe being more explicit in the simple example would be useful.