ropensci / oai

OAI-PMH R client
https://docs.ropensci.org/oai
Other
14 stars 4 forks source link

Recovering from errors #17

Closed mbojan closed 9 years ago

mbojan commented 9 years ago

This is a rather general/thick issue that will have to be split.

At this moment harvesting functions do not have any recovery protocols. This is rather painful with larger requests to OAI-PMH because there is no good way of splitting a request into chunks apart from resumptionToken.

In general there are the following types of errors:

  1. OAI-PMH errors - currently handled by handle_errors
  2. Internet-related errors, like timeouts etc. They are thrown by httr::GET, which is used in several places, but most importantly in while_oai, which is the core of list_* functions.
  3. Malformed XML errors - it may happen that the downloaded XML is malformed and cannot be parsed by xml2::read_xml. For example, contains UTF-8 characters that are illegal for XML, that are not escaped etc.

It would be useful to come up with a way to recover from such errors. For example:

sckott commented 9 years ago

Good ideas

http errors

by adding an optional sleep time between requests in while_oai

Have you seen problems due to making simultaneous requests too quickly? I haven't yet, but doesn't mean they aren't there

All functions should have the ... for passing on curl options. Do you think we need additional infrastructure?

malformed XML

Sure, makes sense to use e.g., tryCatch() to read the xml, and if get a xml read error, then we could write the unparsed xml to disk, and grep out the resumption token.

Wonder if it makes sense to let user optionally simply stop on xml read errors (with error message saying so), or maybe that's not helpful

mbojan commented 9 years ago

Have you seen problems due to making simultaneous requests too quickly? I haven't yet, but doesn't mean they aren't there

Yep. After 1.5hr of list_records harvesting, which comprised of about 150-200 requests I got a "504 gateway timeout", which is probably a result of poor equipment or network configuration of the service. To deal with this, I manually saved the resumptionToken and used it in a new call to list_records after an hour or so....

I isolated http issues #23 and XML issues #24

sckott commented 9 years ago

Seems like given #23 and #24, this can now be closed?

mbojan commented 9 years ago

I think so.