Extract url(s) from a Paginated GET request

boshek commented 6 years ago

I am trying to see if I can extract the url(s) for the GET request from a paginated request. I think the reprex below illustrates the question:

library(crul)

cli <- crul::HttpClient$new(url = "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items")

query_list <- list(STATION_NUMBER = "10PC003")

res_only_one <- cli$get(query = query_list)

## Able to access the url for the GET request
res_only_one$url
#> [1] "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items?STATION_NUMBER=10PC003"

parsed <- jsonlite::fromJSON(res_only_one$parse("UTF-8"))

## Number returned & Number matched
if(parsed$numberReturned < parsed$numberMatched) stop("Need to paginate!", call. = FALSE)
#> Error: Need to paginate!

^{Created on 2018-09-17 by the reprex package (v0.2.1)}

So then I would like to use pagination:

library(crul)

cli <- crul::HttpClient$new(url = "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items")

query_list <- list(STATION_NUMBER = "10PC003")

res_max <- cli$get(query = c(query_list, limit = 1))
txt_max <- res_max$parse("UTF-8")
num <- jsonlite::fromJSON(txt_max)$numberMatched

cc <- crul::Paginator$new(client = cli, by = "query_params", limit_param = "limit",
                           offset_param = "startindex", limit = num, limit_chunk = 500)

cc$get(query = query_list)
#> OK

## A successful pagination
cc$times()
#> [[1]]
#>      redirect    namelookup       connect   pretransfer starttransfer 
#>      0.000000      0.000001      0.000001      0.000001      0.172000 
#>         total 
#>      0.531000 
#> 
#> [[2]]
#>      redirect    namelookup       connect   pretransfer starttransfer 
#>      0.000000      0.000001      0.000001      0.000001      0.172000 
#>         total 
#>      0.266000 
#> 
#> [[3]]
#>      redirect    namelookup       connect   pretransfer starttransfer 
#>      0.000000      0.000001      0.000001      0.000001      0.125000 
#>         total 
#>      0.141000

But if I request the link I only get the base url:

cc$http_req$url
#> [1] "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items"

The API itself does provide the urls so this is an example of what I am after but I would like to get these before they head off via the GET request:

res_parsed  <- lapply(cc$parse(), jsonlite::fromJSON)

lapply(seq_len(length(res_parsed)), function(x){res_parsed[[x]]$links[3,"href"]})
#> [[1]]
#> [1] "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items/?startindex=500"
#> 
#> [[2]]
#> [1] "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items/?startindex=1000"
#> 
#> [[3]]
#> [1] "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items/?startindex=1500"

^{Created on 2018-09-17 by the reprex package (v0.2.1)}

So my question: Is there anyway in crul to get the urls for GET request much like an unpaginated request?

Session Info

```r Session info --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- setting value version R version 3.5.1 (2018-07-02) system x86_64, mingw32 ui RStudio (1.2.992) language (EN) collate English_Canada.1252 tz America/Los_Angeles date 2018-09-17 Packages ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- package * version date source assertthat 0.2.0 2017-04-11 CRAN (R 3.5.1) backports 1.1.2 2017-12-13 CRAN (R 3.5.0) base * 3.5.1 2018-07-02 local base64enc 0.1-3 2015-07-28 CRAN (R 3.5.0) callr 3.0.0 2018-08-24 CRAN (R 3.5.1) clipr 0.4.1 2018-06-23 CRAN (R 3.5.1) compiler 3.5.1 2018-07-02 local crayon 1.3.4 2017-09-16 CRAN (R 3.5.1) crul * 0.6.0 2018-07-10 CRAN (R 3.5.1) curl 3.2 2018-03-28 CRAN (R 3.5.1) datasets * 3.5.1 2018-07-02 local devtools * 1.13.6 2018-06-27 CRAN (R 3.5.1) digest 0.6.17 2018-09-12 CRAN (R 3.5.1) evaluate 0.11 2018-07-17 CRAN (R 3.5.1) fs 1.2.6 2018-08-23 CRAN (R 3.5.1) glue 1.3.0 2018-09-04 Github (tidyverse/glue@4e74901) graphics * 3.5.1 2018-07-02 local grDevices * 3.5.1 2018-07-02 local htmltools 0.3.6 2017-04-28 CRAN (R 3.5.1) httpcode 0.2.0 2016-11-14 CRAN (R 3.5.0) jsonlite 1.5 2017-06-01 CRAN (R 3.5.1) knitr 1.20 2018-02-20 CRAN (R 3.5.1) lobstr * 0.0.0.9000 2018-07-20 Github (r-lib/lobstr@a80d8f8) magrittr 1.5 2014-11-22 CRAN (R 3.5.1) memoise 1.1.0 2017-04-21 CRAN (R 3.5.1) methods * 3.5.1 2018-07-02 local processx 3.2.0 2018-08-16 CRAN (R 3.5.1) ps 1.1.0 2018-08-10 CRAN (R 3.5.1) R6 2.2.2 2017-06-17 CRAN (R 3.5.1) Rcpp 0.12.18 2018-07-23 CRAN (R 3.5.1) reprex 0.2.1 2018-09-16 CRAN (R 3.5.1) rlang 0.2.2 2018-08-16 CRAN (R 3.5.1) rmarkdown 1.10 2018-06-11 CRAN (R 3.5.1) rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.1) rstudioapi 0.7 2017-09-07 CRAN (R 3.5.1) stats * 3.5.1 2018-07-02 local stringi 1.2.4 2018-07-20 CRAN (R 3.5.1) stringr 1.3.1 2018-05-10 CRAN (R 3.5.1) testthat * 2.0.0 2017-12-13 CRAN (R 3.5.1) tools 3.5.1 2018-07-02 local triebeard 0.3.0 2016-08-04 CRAN (R 3.5.1) urltools 1.7.1 2018-08-03 CRAN (R 3.5.1) usethis * 1.4.0 2018-08-14 CRAN (R 3.5.1) utils * 3.5.1 2018-07-02 local whisker 0.3-2 2013-04-28 CRAN (R 3.5.1) withr 2.1.2 2018-03-15 CRAN (R 3.5.1) ```

sckott commented 6 years ago

Thanks @boshek

So you want to get the full url with query parameters BEFORE the request is sent? What's the use case? Maybe we could add a fxn to Paginator to give back the full URLs?

boshek commented 6 years ago

The use case, to me, is when developing a client to access an API (especially one that might be poorly documented) having a means to check the urls as you iteratively make calls, especially with Pagination, to develop your R package, to me, is super useful. It more closely connects the code you are writing to the specs of the API.

full url with query parameters BEFORE the request is sent

To me it doesn't matter if it is before or after. Just if I can see what is happening easily.

Maybe we could add a fxn to Paginator to give back the full URLs

To me the path of least resistance would be to have Paginator behave the same way of HttpClient. So to crib the example above, something as "simple" as this:

cc$url
#> [1] "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items/?startindex=500"
#> [2] "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items/?startindex=1000"
#> [3] "http://geo.weather.gc.ca/geomet-beta/features/collections/hydrometric-daily-mean/items/?startindex=1500"

That would mimic the original behaviour of HttpClient which would then make it seamless for a user. Not having experience programming in R6 I can't for certain say how challenging this is.

sckott commented 6 years ago

Thanks.

To be clear, the full URL doesn't come from HttpClient but after making the request, in the HttpResponse object. So in that case the full url is only available after the request is made.

Do you know about verbose curl output? Is it too verbose? iie.e., you just want the URL slash want to be able to control the output? e..g, (just showing the headers)

cc <- HttpClient$new('https://scottchamberlain.info')
cc$get(verbose = TRUE)
> GET / HTTP/1.1
Host: scottchamberlain.info
User-Agent: libcurl/7.54.0 r-curl/3.2 crul/0.6.0
Accept-Encoding: gzip, deflate
Accept: application/json, text/xml, application/xml, */*

< HTTP/1.1 200 OK
< Cache-Control: public, max-age=0, must-revalidate
< Content-Type: text/html; charset=UTF-8
< Date: Mon, 17 Sep 2018 17:09:31 GMT
< Etag: "229c5df55965674706e3ebfbaa3ae0c4-ssl-df"
< Strict-Transport-Security: max-age=31536000
< Content-Encoding: gzip
< Content-Length: 2460
< Age: 100652
< Connection: keep-alive
< Server: Netlify
< Vary: Accept-Encoding
< X-NF-Request-ID: 9749070f-201f-451e-8c75-f394c71a3ea4-12040280
<
* Connection #0 to host scottchamberlain.info left intact

boshek commented 6 years ago

So verbose output is definitely full of info including what I want. That is pretty nice actually. I mean it gets a little out of control with a Paginated request in terms of the volume of output. But in terms of this exact use case it is more than sufficient. We can close this unless you intend to implement something a little less verbose.

Thanks!

sckott commented 6 years ago

I wanted to see if you were aware of curl options (in particular the verbose option) - BUT, it is very verbose, and is a lot more information than just the URL, so:

I'll try a function to get full URLs before the request is made - however, just realized an issue with the full url is that any addiitional url paths, and the query params are passed in to the HTTP verb function calls (e.g., get), so we don't have all the information needed to construct URLs anyway before the HTTP verb function is called

boshek commented 6 years ago

Yeah thank you for pointing the curl options. That was a 💡 for me.

I think URLs after the call are still useful given it is likely to be challenging to get them before though I guess if the HTTP verb calls fails you won't know what was even tried.

sckott commented 6 years ago

@boshek can you reinstall? see https://github.com/ropensci/crul/blob/master/R/paginator.R#L116-L118

boshek commented 6 years ago

Yep this is exactly it. Works for me for both HttpClient and Paginator. Thanks @sckott

sckott commented 6 years ago

cool, glad it works. need to add some tests and such still to make sure it's working as expected

ropensci / crul

Extract url(s) from a Paginated GET request #92