Getting bibliographic records from OpenAlex
Long query URL gives error in `oa_request()` but works in browser #216

Open rkrug opened 4 months ago

rkrug commented 4 months ago

I have an extremely long search query which works in the browser.

But when running

#> Loading required package: usethis
#> Thank you for using openalexR!
#> To acknowledge our work, please cite the package by calling `citation("openalexR")`.
#> To suppress this message, add `openalexR.message = suppressed` to your .Renviron file.
    query_url = "",
    verbose = TRUE
#> Error: lexical error: invalid char in json text.
#>                                        <html>   <head>     <title>Bad 
#>                      (right here) ------^

Created on 2024-03-08 with reprex v2.1.0

yjunechoe commented 4 months ago

Wow this one is really really weird. The problem isn't even about length of the query string. Minimal reprex:

query_substr <- ""
#> Warning in oa_request(query_substr): No records found!
#> list()
#> Response []
#>   Date: 2024-03-08 18:53
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 332 kB
#> {"meta":{"count":1717,"db_response_time_ms":222,"page":1,"per_page":25,"groups_count":null},"results":[{"id...

This happens because httr::GET() for some reason mangles the url when we specify query = .... So with our per-page=1 default:

httr::GET(query_substr, query = list(`per-page` = 1))
#> Response []
#>   Date: 2024-03-08 18:57
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 115 B
#> {"meta":{"count":0,"db_response_time_ms":68,"page":1,"per_page":1,"groups_count":null},"results":[],"group_...

Essentially, GET() sees the " but encoded as %22, so does not escape it with the slash.

So instead of this url from above:

bad_url <- ""

GET() should instead be sending something like this:

good_url <- ""
#> Response []
#>   Date: 2024-03-08 19:33
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 9.69 kB
#> {"meta":{"count":35789,"db_response_time_ms":338,"page":1,"per_page":1,"groups_count":null},"results":[{"id...

One hacky way around that is to add the slash character and ensure that it decodes before GET() sees it:

  URLdecode(gsub("%22", "%5C%22", bad_url))
#> Response [\"Agriculture+reform\"+OR+\"ocean+reform\"&per-page=1]
#>   Date: 2024-03-08 19:30
#>   Status: 200
#>   Content-Type: application/json
#>   Size: 9.69 kB
#> {"meta":{"count":35789,"db_response_time_ms":338,"page":1,"per_page":1,"groups_count":null},"results":[{"id...

So for your reprex, you can do reformat your url:

query_url <- ""
query_url2 <- gsub("%22", "%5C%22", query_url)

This still errors though, but now for a different reason - it's just genuinely long:

#> <html>
#>   <head>
#>     <title>Bad Request</title>
#>   </head>
#>   <body>
#>     <h1><p>Bad Request</p></h1>
#>     Request Line is too large (4468 &gt; 4094)
#>   </body>
#> </html>

Overall I'm completely stumped though. I have no idea why this is an issue and whether this is on our end, OA's end, httr's end, etc.

rkrug commented 4 months ago

Hm. What about using the opportunity to move to httr2? That would exclude one possible culprit.

Also - if I could try to get somebody from OA to look at it - maybe log files?

yjunechoe commented 4 months ago

Switching over to httr2 would indeed be nice but it'll require more than just rewriting code and I currently don't have the bandwidth for this - I'll keep the issue in mind but for now the workaround above should do.

yjunechoe commented 4 months ago

Sorry just for completeness - what function call generated the long query URL you originally posted? Was it spit out by oa_query() (if so, what were the inputs??

rkrug commented 3 months ago

I got the URL from the OpenAlex web interface. If I remember correctly, the original search term did not work via openalexR (same symptoms as to long, but probably something different - by the way, it would be niche to give a warning if the url might be to long), so I tried the API to find out by how much. But there it worked. So I copied the API call back into the openalexR call, which is where it did not worked.

rkrug commented 3 months ago

Switching over to httr2 would indeed be nice but it'll require more than just rewriting code

Could you elaborate? Why do you say that? I agree, that a switch to httr2 opens the possibility to do some breaking changes (openalexR2), but why do you say that is necessary?

yjunechoe commented 3 months ago

Could you elaborate? Why do you say that? I agree, that a switch to httr2 opens the possibility to do some breaking changes (openalexR2), but why do you say that is necessary?

Oh - it's not necessary to switch over at all! I just meant that if we were to, it would require quite a bit of work.