ropensci / openalexR

Getting bibliographic records from OpenAlex
https://docs.ropensci.org/openalexR/
Other
100 stars 21 forks source link

Long query URL gives error in `oa_request()` but works in browser #216

Open rkrug opened 7 months ago

rkrug commented 7 months ago

I have an extremely long search query which works in the browser.

But when running

library(devtools)
#> Loading required package: usethis
library(openalexR)
#> Thank you for using openalexR!
#> To acknowledge our work, please cite the package by calling `citation("openalexR")`.
#> To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.
oa_request(
    query_url = "https://api.openalex.org/works?page=1&filter=title_and_abstract.search:%22Agriculture+reform%22+OR+%22ocean+reform%22+OR+%22energy+reform%22+OR+%22decarbonization%22+OR+%22Eco-friendly+Subsidies%22+OR+%22Green+Subsidies%22+OR+%22Polluter+Pays+Principle%22+OR+%22Environmental+Externalities%22+OR+%22Biodiversity+Offsetting%22+OR+%22Conservation+Finance%22+OR+%22Payment+for+Ecosystem+Services%22+OR+%22Agri-environmental+Schemes%22+OR+%22Cross-compliance%22+OR+%22Eco-taxes%22+OR+%22Sustainable+Agriculture+Incentives%22+OR+%22Carbon+Pricing%22+OR+%22Biodiversity+Credits%22+OR+%22Habitat+Banking%22+OR+%22Rewilding+Incentives%22+OR+%22Green+Bonds%22+OR+%22Ecological+Fiscal+Transfers%22+OR+%22Renewable+Energy+Subsidies%22+OR+%22Water+Quality+Trading%22+OR+%22Sustainable+Fisheries+Subsidies%22+OR+%22Green+Certification+Schemes%22+OR+%22Conservation+Easements%22+OR+%22Environmental+Impact+Bonds%22+OR+%22Climate+Smart+Agriculture%22+OR+%22Natural+Capital+Financing%22+OR+%22Bioenergy%22+OR+%22Forest+Carbon+Credits%22+OR+%22Blue+Carbon+Initiatives%22+OR+%22Green+Public+Procurement%22+OR+%22Integrated+Pest+Management+Incentives%22+%22Wildlife+Corridors+Funding%22+OR+%22Biodiversity+Banking%22+OR+%22Climate+Adaptation+Finance%22+OR+%22Deforestation+Reduction+Programs%22+OR+%22Environmental+Risk+Assessment%22+OR+%22Green+Infrastructure+Investments%22+OR+%22High+Conservation+Value+Incentives%22+OR+%22Landscape+Restoration+Funds%22+OR+%22Marine+Protected+Areas+Support%22+OR+%22Natural+Resource+Management%22+OR+%22Organic+Farming+Subsidies%22+OR+%22Permaculture+Design+Grants%22+OR+%22Pollination+Services+Payments%22+OR+%22Protected+Area+Financing%22+OR+%22Regenerative+Agriculture+Support%22+OR+%22Sustainability+Linked+Loans%22+OR+%22Urban+Greening+Grants%22+OR+%22Wetlands+Restoration+Funding%22+OR+%22Zero+Emission+Vehicle+Incentives%22+OR+%22Adaptive+Management+Practices%22+OR+%22Biodiversity+Informatics%22+OR+%22Climate+Bonds%22+OR+%22Debt-for-Nature+Swap%22+OR+%22Ecosystem-Based+Adaptation%22+OR+%22Forest+Stewardship+Council+Certification%22+OR+%22Greenhouse+Gas+Inventory%22+%22Habitat+Restoration+Grants%22+OR+%22Invasive+Species+Control+Funding%22+OR+%22Land+Degradation+Neutrality+Fund%22+OR+%22Mitigation+Banking%22+OR+%22Non-Timber+Forest+Product+Incentives%22+%22Ocean+Acidification+Research+Grants%22+OR+%22Pollinator+Habitat+Enhancement%22+OR+%22Renewable+Energy+Certificates%22+OR+%22Soil+Health+Improvement+Programs%22+OR+%22Tree+Planting+Campaigns%22+OR+%22Wildlife+Management+Areas%22+OR+%22Biodiversity+Strategy+and+Action+Plans%22+OR+%22Circular+Economy+Initiatives%22+OR+%22Disaster+Risk+Reduction+Funding%22+OR+%22DRR+Funding%22+OR+%22Ecosystem+Valuation%22+OR+%22Fisheries+Improvement+Projects%22+OR+%22Green+Job+Training+Programs%22+OR+%22Holistic+Management+Funding%22+OR+%22Indigenous+Peoples%27+Biodiversity+Conservation%22+OR+%22Landscape+Connectivity+Projects%22+OR+%22Mangrove+Restoration+Initiatives%22+OR+%22Nature-based+Solutions%22+OR+%22Organic+Certification+Cost+Share%22+OR+%22Peatland+Restoration+and+Management%22+OR+%22Quantitative+Easing+for+the+Planet%22+OR+%22Riparian+Buffer+Zones+Support%22+OR+%22Sustainable+Land+Management%22+OR+%22Threatened+Species+Recovery+Plans%22+OR+%22Urban+Biodiversity+Enhancement%22+OR+%22Vertical+Farming+Incentives%22+OR+%22Water+Efficiency+Programs%22+OR+%22Xeriscaping+Rebates%22+OR+%22Youth+Engagement+in+Conservation%22+OR+%22Zero-waste+Strategies%22+OR+%22Agrobiodiversity+Conservation+Subsidies%22+OR+%22Biochar+Production+Incentives%22+OR+%22Climate+Resilience+Building%22+OR+%22Drought+Management+Assistance%22+OR+%22Eco-labeling+Programs%22+OR+%22Functional+Biodiversity+Promotion%22+OR+%22Green+Supply+Chain+Financing%22+OR+%22Hedgerow+Restoration+Support%22+OR+%22Integrated+Water+Resources+Management+Funding%22+OR+%22Jungle+Restoration+Projects%22",
    verbose = TRUE
)
#> Error: lexical error: invalid char in json text.
#>                                        <html>   <head>     <title>Bad 
#>                      (right here) ------^

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.3 (2024-02-29)
#>  os       macOS Sonoma 14.4
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Zurich
#>  date     2024-03-08
#>  pandoc   3.1.12.2 @ /opt/homebrew/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.0)
#>  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.1)
#>  curl          5.2.1   2024-03-01 [1] CRAN (R 4.3.1)
#>  devtools    * 2.4.5   2022-10-11 [1] CRAN (R 4.3.0)
#>  digest        0.6.34  2024-01-11 [1] CRAN (R 4.3.1)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.3.0)
#>  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.1)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  fs            1.6.3   2023-07-20 [1] CRAN (R 4.3.0)
#>  glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.1)
#>  htmltools     0.5.7   2023-11-03 [1] CRAN (R 4.3.1)
#>  htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.3.1)
#>  httpuv        1.6.14  2024-01-26 [1] CRAN (R 4.3.1)
#>  httr          1.4.7   2023-08-15 [1] CRAN (R 4.3.0)
#>  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.1)
#>  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.1)
#>  later         1.3.2   2023-12-06 [1] CRAN (R 4.3.1)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.3.0)
#>  mime          0.12    2021-09-28 [1] CRAN (R 4.3.0)
#>  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
#>  openalexR   * 1.2.3   2023-11-16 [1] CRAN (R 4.3.1)
#>  pkgbuild      1.4.3   2023-12-10 [1] CRAN (R 4.3.1)
#>  pkgload       1.3.4   2024-01-16 [1] CRAN (R 4.3.1)
#>  profvis       0.3.8   2023-05-02 [1] CRAN (R 4.3.0)
#>  promises      1.2.1   2023-08-10 [1] CRAN (R 4.3.0)
#>  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo          1.26.0  2024-01-24 [1] CRAN (R 4.3.1)
#>  R.utils       2.12.3  2023-11-18 [1] CRAN (R 4.3.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
#>  Rcpp          1.0.12  2024-01-09 [1] CRAN (R 4.3.1)
#>  remotes       2.4.2.1 2023-07-18 [1] CRAN (R 4.3.0)
#>  reprex        2.1.0   2024-01-11 [1] CRAN (R 4.3.1)
#>  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.1)
#>  rmarkdown     2.26    2024-03-05 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  shiny         1.8.0   2023-11-17 [1] CRAN (R 4.3.1)
#>  stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.1)
#>  stringr       1.5.1   2023-11-14 [1] CRAN (R 4.3.1)
#>  styler        1.10.2  2023-08-29 [1] CRAN (R 4.3.0)
#>  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.3.0)
#>  usethis     * 2.2.3   2024-02-19 [1] CRAN (R 4.3.1)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.1)
#>  withr         3.0.0   2024-01-16 [1] CRAN (R 4.3.1)
#>  xfun          0.42    2024-02-08 [1] CRAN (R 4.3.1)
#>  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.3.0)
#>  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.1)
#> 
#>  [1] /Users/rainerkrug/R/library/aarch64-apple-darwin20/4.3
#>  [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Created on 2024-03-08 with reprex v2.1.0

yjunechoe commented 7 months ago

Wow this one is really really weird. The problem isn't even about length of the query string. Minimal reprex:

query_substr <- "https://api.openalex.org/works?page=1&filter=title_and_abstract.search:%22Agriculture+reform%22+OR+%22ocean+reform%22"
oa_request(query_substr)
#> Warning in oa_request(query_substr): No records found!
#> list()
httr::GET(query_substr)
#> Response [https://api.openalex.org/works?page=1&filter=title_and_abstract.search:%22Agriculture+reform%22+OR+%22ocean+reform%22]
#>   Date: 2024-03-08 18:53
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 332 kB
#> {"meta":{"count":1717,"db_response_time_ms":222,"page":1,"per_page":25,"groups_count":null},"results":[{"id...

This happens because httr::GET() for some reason mangles the url when we specify query = .... So with our per-page=1 default:

httr::GET(query_substr, query = list(`per-page` = 1))
#> Response [https://api.openalex.org/works?page=1&filter=title_and_abstract.search%3A%22Agriculture%2Breform%22%2BOR%2B%22ocean%2Breform%22&per-page=1]
#>   Date: 2024-03-08 18:57
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 115 B
#> {"meta":{"count":0,"db_response_time_ms":68,"page":1,"per_page":1,"groups_count":null},"results":[],"group_...

Essentially, GET() sees the " but encoded as %22, so does not escape it with the slash.

So instead of this url from above:

bad_url <- "https://api.openalex.org/works?page=1&filter=title_and_abstract.search%3A%22Agriculture%2Breform%22%2BOR%2B%22ocean%2Breform%22&per-page=1"

GET() should instead be sending something like this:

good_url <- "https://api.openalex.org/works?page=1&filter=title_and_abstract.search:%5C%22Agriculture+reform%5C%22+OR+%5C%22ocean+reform%5C%22&per-page=1"
httr::GET(good_url)
#> Response [https://api.openalex.org/works?page=1&filter=title_and_abstract.search:%5C%22Agriculture+reform%5C%22+OR+%5C%22ocean+reform%5C%22&per-page=1]
#>   Date: 2024-03-08 19:33
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 9.69 kB
#> {"meta":{"count":35789,"db_response_time_ms":338,"page":1,"per_page":1,"groups_count":null},"results":[{"id...

One hacky way around that is to add the slash character and ensure that it decodes before GET() sees it:

httr::GET(
  URLdecode(gsub("%22", "%5C%22", bad_url))
)
#> Response [https://api.openalex.org/works?page=1&filter=title_and_abstract.search:\"Agriculture+reform\"+OR+\"ocean+reform\"&per-page=1]
#>   Date: 2024-03-08 19:30
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 9.69 kB
#> {"meta":{"count":35789,"db_response_time_ms":338,"page":1,"per_page":1,"groups_count":null},"results":[{"id...

So for your reprex, you can do reformat your url:

query_url <- "https://api.openalex.org/works?page=1&filter=title_and_abstract.search:%22Agriculture+reform%22+OR+%22ocean+reform%22+OR+%22energy+reform%22+OR+%22decarbonization%22+OR+%22Eco-friendly+Subsidies%22+OR+%22Green+Subsidies%22+OR+%22Polluter+Pays+Principle%22+OR+%22Environmental+Externalities%22+OR+%22Biodiversity+Offsetting%22+OR+%22Conservation+Finance%22+OR+%22Payment+for+Ecosystem+Services%22+OR+%22Agri-environmental+Schemes%22+OR+%22Cross-compliance%22+OR+%22Eco-taxes%22+OR+%22Sustainable+Agriculture+Incentives%22+OR+%22Carbon+Pricing%22+OR+%22Biodiversity+Credits%22+OR+%22Habitat+Banking%22+OR+%22Rewilding+Incentives%22+OR+%22Green+Bonds%22+OR+%22Ecological+Fiscal+Transfers%22+OR+%22Renewable+Energy+Subsidies%22+OR+%22Water+Quality+Trading%22+OR+%22Sustainable+Fisheries+Subsidies%22+OR+%22Green+Certification+Schemes%22+OR+%22Conservation+Easements%22+OR+%22Environmental+Impact+Bonds%22+OR+%22Climate+Smart+Agriculture%22+OR+%22Natural+Capital+Financing%22+OR+%22Bioenergy%22+OR+%22Forest+Carbon+Credits%22+OR+%22Blue+Carbon+Initiatives%22+OR+%22Green+Public+Procurement%22+OR+%22Integrated+Pest+Management+Incentives%22+%22Wildlife+Corridors+Funding%22+OR+%22Biodiversity+Banking%22+OR+%22Climate+Adaptation+Finance%22+OR+%22Deforestation+Reduction+Programs%22+OR+%22Environmental+Risk+Assessment%22+OR+%22Green+Infrastructure+Investments%22+OR+%22High+Conservation+Value+Incentives%22+OR+%22Landscape+Restoration+Funds%22+OR+%22Marine+Protected+Areas+Support%22+OR+%22Natural+Resource+Management%22+OR+%22Organic+Farming+Subsidies%22+OR+%22Permaculture+Design+Grants%22+OR+%22Pollination+Services+Payments%22+OR+%22Protected+Area+Financing%22+OR+%22Regenerative+Agriculture+Support%22+OR+%22Sustainability+Linked+Loans%22+OR+%22Urban+Greening+Grants%22+OR+%22Wetlands+Restoration+Funding%22+OR+%22Zero+Emission+Vehicle+Incentives%22+OR+%22Adaptive+Management+Practices%22+OR+%22Biodiversity+Informatics%22+OR+%22Climate+Bonds%22+OR+%22Debt-for-Nature+Swap%22+OR+%22Ecosystem-Based+Adaptation%22+OR+%22Forest+Stewardship+Council+Certification%22+OR+%22Greenhouse+Gas+Inventory%22+%22Habitat+Restoration+Grants%22+OR+%22Invasive+Species+Control+Funding%22+OR+%22Land+Degradation+Neutrality+Fund%22+OR+%22Mitigation+Banking%22+OR+%22Non-Timber+Forest+Product+Incentives%22+%22Ocean+Acidification+Research+Grants%22+OR+%22Pollinator+Habitat+Enhancement%22+OR+%22Renewable+Energy+Certificates%22+OR+%22Soil+Health+Improvement+Programs%22+OR+%22Tree+Planting+Campaigns%22+OR+%22Wildlife+Management+Areas%22+OR+%22Biodiversity+Strategy+and+Action+Plans%22+OR+%22Circular+Economy+Initiatives%22+OR+%22Disaster+Risk+Reduction+Funding%22+OR+%22DRR+Funding%22+OR+%22Ecosystem+Valuation%22+OR+%22Fisheries+Improvement+Projects%22+OR+%22Green+Job+Training+Programs%22+OR+%22Holistic+Management+Funding%22+OR+%22Indigenous+Peoples%27+Biodiversity+Conservation%22+OR+%22Landscape+Connectivity+Projects%22+OR+%22Mangrove+Restoration+Initiatives%22+OR+%22Nature-based+Solutions%22+OR+%22Organic+Certification+Cost+Share%22+OR+%22Peatland+Restoration+and+Management%22+OR+%22Quantitative+Easing+for+the+Planet%22+OR+%22Riparian+Buffer+Zones+Support%22+OR+%22Sustainable+Land+Management%22+OR+%22Threatened+Species+Recovery+Plans%22+OR+%22Urban+Biodiversity+Enhancement%22+OR+%22Vertical+Farming+Incentives%22+OR+%22Water+Efficiency+Programs%22+OR+%22Xeriscaping+Rebates%22+OR+%22Youth+Engagement+in+Conservation%22+OR+%22Zero-waste+Strategies%22+OR+%22Agrobiodiversity+Conservation+Subsidies%22+OR+%22Biochar+Production+Incentives%22+OR+%22Climate+Resilience+Building%22+OR+%22Drought+Management+Assistance%22+OR+%22Eco-labeling+Programs%22+OR+%22Functional+Biodiversity+Promotion%22+OR+%22Green+Supply+Chain+Financing%22+OR+%22Hedgerow+Restoration+Support%22+OR+%22Integrated+Water+Resources+Management+Funding%22+OR+%22Jungle+Restoration+Projects%22"
query_url2 <- gsub("%22", "%5C%22", query_url)

This still errors though, but now for a different reason - it's just genuinely long:

cat(rawToChar(
  httr::GET(query_url2)$content
))
#> <html>
#>   <head>
#>     <title>Bad Request</title>
#>   </head>
#>   <body>
#>     <h1><p>Bad Request</p></h1>
#>     Request Line is too large (4468 &gt; 4094)
#>   </body>
#> </html>

Overall I'm completely stumped though. I have no idea why this is an issue and whether this is on our end, OA's end, httr's end, etc.

rkrug commented 7 months ago

Hm. What about using the opportunity to move to httr2? That would exclude one possible culprit.

Also - if I could try to get somebody from OA to look at it - maybe log files?

yjunechoe commented 7 months ago

Switching over to httr2 would indeed be nice but it'll require more than just rewriting code and I currently don't have the bandwidth for this - I'll keep the issue in mind but for now the workaround above should do.

yjunechoe commented 7 months ago

Sorry just for completeness - what function call generated the long query URL you originally posted? Was it spit out by oa_query() (if so, what were the inputs??

rkrug commented 7 months ago

I got the URL from the OpenAlex web interface. If I remember correctly, the original search term did not work via openalexR (same symptoms as to long, but probably something different - by the way, it would be niche to give a warning if the url might be to long), so I tried the API to find out by how much. But there it worked. So I copied the API call back into the openalexR call, which is where it did not worked.

rkrug commented 7 months ago

Switching over to httr2 would indeed be nice but it'll require more than just rewriting code

Could you elaborate? Why do you say that? I agree, that a switch to httr2 opens the possibility to do some breaking changes (openalexR2), but why do you say that is necessary?

yjunechoe commented 7 months ago

Could you elaborate? Why do you say that? I agree, that a switch to httr2 opens the possibility to do some breaking changes (openalexR2), but why do you say that is necessary?

Oh - it's not necessary to switch over at all! I just meant that if we were to, it would require quite a bit of work.