`iterate_next_request()` -> `resp_next_request()`

hadley commented 11 months ago

This would mean changing the arguments so that it takes a request as the first argument, and maybe automatically attaching the parsed data and next_request call back to that request.

That possibly implies that registering a parser is a more general feature, which we could possibly implement lazily. i.e. you'd register a parser in the request (with req_parser()?), and then call resp_parsed() (or maybe resp_body_parsed()) to access the parsed body. We'd need to add an environment to the response so that we could do the parsing once and cache it. And then maybe have some helper that does resp_parse(req_perform(req))? And something similar for req_perform_parallel()? That would make req_perform_iterate() fit in better.

@mgirlich these are very quick and rough notes, but does this make sense to you?

hadley commented 11 months ago

Actually, since we're introducing req_peform_parallel() we could make it automatically parse, just like req_perform_iterate().

mgirlich commented 11 months ago

I agree that it might make sense to add a separate req_parser(). The only thing I'm not quite sure about is documentation and return type. For req_paginate() the output of parse_resp() should be a list with a field data and any other fields that are needed for the pagination to work. For req_batch() and req_perform_parallel() returning a list with a field data would feel artificial.

And we definitely need a helper like resp_body_parsed(). When I wanted to document how to manually iterate through pages I was missing this.

Automatically parsing and vec_c()ing the results in req_peform_parallel() and req_perform_iteratively() is awesome, so we should definitely do that.

This would mean changing the arguments so that it takes a request as the first argument, and maybe automatically attaching the parsed data and next_request call back to that request.

It currently takes a request as the first argument. Did you mean response?

hadley commented 11 months ago

Yeah, sorry, a response as a first argument

hadley commented 11 months ago

Maybe we need req_parse_data() and req_parse_metadata()? Or two arguments to req_parse()? But that's not quite right for the case (like GitHub) where a paginated request returns a both data and metadata in the json. So maybe the req_parse() function should return a named list containing data and metadata? But then that feels weird for the straightforward req_perform_parallel() case. Hmmmm.

(Also I think it makes sense for req_perform() to automatically parse the result if you've used req_parse(). That feels a little type-unstable, but it's very convenient for the other perform functions and will be backward compatible).

hadley commented 11 months ago

Or maybe we could do iteration by making req_parse() return a special object, e.g.

# cursor pagination; metadata in body
request("https://example.com/cursor") %>%
  req_url_query(per_page = 100) %>%
  req_parse(function(resp, req) {
    json <- resp_body_json(resp)
    iterative_response(
      data = json$data,
      next_req = req %>% req_url_param(resp_link, cursor = json$next_cursor),
      total = json$count
    )
  })

# keyset/seek pagination; metadata in headers
request("https://example.com/keyset") %>%
  req_parse(function(resp, req) {
    data <- resp_body_json(resp)
    iterative_response(
      data = data,
      next_req = req %>% req_url_param(after_id = max(data$id)),
      total = resp %>% resp_header()
    )
  })

# HATEOAS pagination (e.g. GitHub); metadata in headers
request("https://api.github.com/repositories") %>%
  req_url_query(per_page = 100) %>%
  req_parse(function(resp, req) {
    iterative_response(
      data = resp_body_json(resp),
      next_req = req %>% req_url(resp_link_url(resp, "next")),
      total = resp %>% resp_link_url("last") %>% url_parse() %>% .$query$page
    )
  })

Offset pagination doesn't work well with this framing, but that might make sense because offset pagination is random access, so it's really a better fit for req_perform_parallel().

hadley commented 11 months ago

Best article I found on pagination was https://ignaciochiazzo.medium.com/paginating-requests-in-apis-d4883d4c1c4c, as it includes a bunch of links to various styles in popular apis. From the hacker news comments:

So cursor pagination is just keyset pagination, except instead of using a column value directly you use an opaque string that gets translated back to a column value. So you can modify how the cursor gets translated back to a column value anytime without breaking API compatibility.

Good write up of why offset pagination is suboptimal on the backend: https://use-the-index-luke.com/no-offset

mgirlich commented 11 months ago

Having iterative_response() is elegant but as you said it doesn't work well with offset pagination (or pagination where you provide the page number but this is basically also offset pagination). My bigger concern would be about how to discover/remember this as a user compared to req_paginate().

hadley commented 11 months ago

I agree that it’s a bit less discoverable, but it’s simpler and more general, and I think we could make up the difference with docs.

mgirlich commented 11 months ago

We could also add iterative_response() and still keep req_paginate() which would basically be a wrapper around parse_resp() + iterative_response().

And do you have an idea how we want to handle page/offset pagination? They are not really iterative but you need to do the first request to find out how many pages there are. So, creating a list of requests also doesn't work nicely.

hadley commented 11 months ago

Let me noodle on this a bit more. I had forgotten about the wrinkle of requiring one request to figure out the total number of pages: it's like an iterative request that then spawns parallel requests.

It also feels like there's maybe some connection to https://github.com/tidyverse/rvest/issues/193, where you want to crawl/spider a bunch of pages. There it makes more sense to think about a queue where each request could add one or more pages to the queue, which are then potentially processed in parallel. That is a less common pattern for APIs, but a queue is a nicely general structure.

hadley commented 11 months ago

Obviously we'd want some higher level wrapper, but maybe the internals could look something like this:

page_size <- 100

# Paginated query
request("https://example.com/paginated") %>%
  req_url_query(per_page = page_size) %>%
  req_parse(function(resp, req) {
    json <- resp_body_json(resp)
    index <- seq_len(ceiling(json$count / page_size))
    pages <- lapply(index, function(i) {
      req %>% 
        req_parse(\(resp) resp_body_json(resp)$data) %>% 
        req_url_param(resp_link, index = i)
    })

    multi_response(
      data = json$data,
      responses = pages
    )
  })

hadley commented 10 months ago

I think I've swung back towards the idea of providing a way to register a parser so that you can call resp_body() and it's parsed lazily and at most once. Something like this maybe:

req <- req_parse_resp(req, resp_body_json)
resp <- req_perform(req)
# actually parses:
resp_body(resp)
# uses cached value:
resp_body(resp)

Alternatively we could just make each of the existing resp_body_() cache its result so that (e.g.) no matter how many times you call resp_body_json() it only ever has to parse the data once. That's appealing because it doesn't require introducing a new pair of functions, but it would make it a little harder for the user to extend with custom parsers (but that seems relatively low priority given that the majority of modern APIs use JSON).

The next change I'd suggest is that instead of providing the callback to generate the next request in req_paginate(), we move it to req_perform_iterative() and allow it to return either a single vector or list:

req <- request("https://example.com/cursor") |>
  req_url_query(per_page = 100) |>
  req_parse_body(resp_body_json) 

# cursor
resps <- req_perform_iterative(req, function(resp, req) {
  body <- resp_body(resp)
  if (is.null(body$next_cursor)) {
    return()
  }

  req |> req_url_param(cursor = body$next_cursor)
})

# indexed
resps <- req_perform_iterative(req, function(resp, req) {
  body <- resp_body(resp)

  index <- seq_len(ceiling(body$count / page_size))[-1]
  lapply(index, \(i) req |> req_url_param(resp_link, index = i))
})

This formulation still gives us the ability to write wrappers for common situations, e.g.:

paginate_cursor <- function(param_name, param_value) {
  check_string(param_name)
  check_function2(param_value, "resp")

  function(resp, req) {
    value <- param_value(resp)
    if (is.null(value)) {
      return()
    }

    req |> req_url_param(!!param_value := value)
  }
}

resps <- req_perform_iterative(req, paginate_cursor("cursor", \(body) body$cursor)

Then req_perform_iterative() would return a list (or maybe a richer object) and we'd provide helpers to interact with it:

resps_combine <- function(resps, data) {
  check_function2(data, "body")

  resps <- resps_responses(resps)
  body <- lapply(resps, resp_body)
  data <- lapply(body, data)
  vctrs::vec_c(!!!data)
}
resps_is_resp <- function(resps) {
  vapply(resps, is_response, logical(1))
}
resps_responses <- function(resps) {
  resps[resps_is_resp(resps)]
}
resps_errors <- function(resps) {
  resps[!resps_is_resp(resps)]
}

That would make error handling much more general because we just put the tools in the hands of the user.

Open questions:

How does the next_req callback signal the number of total pages, if known. We could provide a callback in req_perform_iterative() (matching the current API) but another approach occurred to me: we could use a signal. But this is probably too clever.

total_pages <- function(n) {
  signal(class = "httr2_pages", n = n)
}
resps <- req_perform_iterative(req, function(resp, req) {
  body <- resp_body(resp)
  if (is.null(body$next_cursor)) {
    return()
  }
  total_pages(body$total)

  req |> req_url_param(cursor = body$next_cursor)
})

How does this influence the design of req_chunk()? I think it pushes that away from being its own function and instead you'd use FP to generate the requests, req_perform_parallel() to perform them, and then the helpers to get the data.

ids <- 1:7
chunk_size <- 3

chunked_id <- split(ids, (seq_along(ids) - 1) %/% chunk_size)
reqs <- lapply(chunked_id, function(ids) req |> req_url_query(id = ids))
resps <- req_perform_parallel(reqs)

resps |> resps_combine(function(body) {
  data.frame(
    id = sapply(body, function(x) x$id),
    name = sapply(body, function(x) x$name)
  )
})

Maybe we'd still provide a function to help generate that list?

In the second case where the callback returns a list, presumably the callback isn't applied iteratively on the response to those requests? Is it confusing that there are essentially two modes? (i.e. iterate until you get a NULL, vs get a list.)

mgirlich commented 10 months ago

I think always returning a list and providing some good helpers is fine. It is easy to understand and very flexible.
If we don't add req_chunk() we should definitely add a helper to create the requests as this is needed for many APIs.
A signal for the total number of pages feels more difficult to discover compared to another callback. But it would be way less verbose.
req_perform_iterative() would prevent that you try to use req_perform() on a paginated request and it only gives you the first page.

Some more questions for req_perform_iterative():

will a list of requests automatically be performed in parallel? How would that affect the interface (e.g. do wo now need a pool argument)?
how should the path argument look like? E.g. something like {my-path}_{i}.rds where i is the current request number?

hadley commented 10 months ago

Thanks for taking a look! Since you're on board with the plan I'll start turning it into PRs.

Currently I'd think that req_perform_iterative() would work in serial. But I'll see how it goes when I implement it. If parallel is easy, it seems like it would be worthwhile.

I hadn't thought about path. I like your idea of a simple glue spec, and then maybe it could also take a function that accepts a req if you want something more complex.

hadley commented 10 months ago

Started an implementation and it feels really good. Because the next_req callback no longer belongs to the page, it makes implementing paged iteration much easier:

iterate_by_page <- function(param_name, start = 1, offset = 1) {
  check_string(param_name)
  check_number_whole(start)
  check_number_whole(offset, min = 1)

  i <- start
  function(resp, req) {
    old_i <- i
    i <<- i + offset

    req %>% req_url_query(!!param_name := old_i)
  }
}

We couldn't use a technique like this before because the current page state would shared across all requests.

r-lib / httr2

`iterate_next_request()` -> `resp_next_request()` #341