ropensci / oai

OAI-PMH R client
https://docs.ropensci.org/oai
Other
14 stars 4 forks source link

Flow control is ignored -- causing unrecoverable errors #64

Closed maltman closed 1 year ago

maltman commented 2 years ago

Try:

out <- oai::list_records(url="http://export.arxiv.org/oai2",
  prefix="arXiv",
  from='2022-04-01',
  )

This results in:

Service Unavailable (HTTP 503)

No results are returned, even though partial results were collected. So there is no graceful way to resume.

Their are three issues here that seem to make the overall interface non-robust

  1. there is no mechanism for setting a delay between subsequent requests ... so
  2. the arxiv server eventually issues a 503 flow control directive -- which the client treats as a permanent failure -- rather than an instruction to delay for the specified value and retry. Causing the client to stop() which ...
  3. aborts the call, rather returning partial results ... so no resumption is possible.

A possible workaround could be to write an external wrapper that divides up the "from" - "to" interval into small chunks, and uses purr:: wrappers to schedule each chunk and retry... This is inelegant. A cleaner solution might be to handle the OAI flow control explicitly internally in while_oai(), and to at least return partial values and a resumption token on error.

mbojan commented 2 years ago

Thanks a lot @maltman for this report. Indeed at this time the mechanism is rather primitive. We'll definitely look into handling this by while_oai().

mbojan commented 2 years ago

Researching this further (CC @sckott)...

It would be natural for while_oai() to switch from httr::GET() to httr::RETRY() which has a built-in functionality to take advantage of retry-after in the response header.

Problems/questions:

maltman commented 2 years ago

Thanks for the quick response!

Comment -- yes, a 429 is likely a better choice for a new protocol. However, a 503+ retry is documented in the HTTP 1.1 RFC and the OAI-PMH specs. So it seems likely this is not a case limited to arXiv.

http://www.openarchives.org/OAI/openarchivesprotocol.html#StatusCodes , http://www.openarchives.org/OAI/openarchivesprotocol.html#FlowControl and in the HTTP 1.1 RFC -- https://datatracker.ietf.org/doc/html/rfc7231#section-6.6.4

(429's are part of a different RFC -- "additional status codes").

mbojan commented 2 years ago

Thanks @maltman for these links.

I'm testing it right now, but your original query just does not want to fail now and I'm getting 200s only... :D

mbojan commented 2 years ago

I just pushed i64-retrying branch which replaces GET() with RETRY(). The CI still chews on it. @maltman , can you please install from that branch and check whether it works for you?

BTW that OAI query returns quite a big chunks of results. You may want to take advantage of a dumper function (see ?dumpers) to save the results incrementally.

mbojan commented 1 year ago

Fixed by https://github.com/ropensci/oai/commit/57ab8908e816d0c86988778fee10f12a695e6614