Closed mbojan closed 9 years ago
Good ideas
http errors
by adding an optional sleep time between requests in
while_oai
Have you seen problems due to making simultaneous requests too quickly? I haven't yet, but doesn't mean they aren't there
All functions should have the ...
for passing on curl options. Do you think we need additional infrastructure?
malformed XML
Sure, makes sense to use e.g., tryCatch()
to read the xml, and if get a xml read error, then we could write the unparsed xml to disk, and grep out the resumption token.
Wonder if it makes sense to let user optionally simply stop on xml read errors (with error message saying so), or maybe that's not helpful
Have you seen problems due to making simultaneous requests too quickly? I haven't yet, but doesn't mean they aren't there
Yep. After 1.5hr of list_records
harvesting, which comprised of about 150-200 requests I got a "504 gateway timeout", which is probably a result of poor equipment or network configuration of the service. To deal with this, I manually saved the resumptionToken
and used it in a new call to list_records
after an hour or so....
I isolated http issues #23 and XML issues #24
Seems like given #23 and #24, this can now be closed?
I think so.
This is a rather general/thick issue that will have to be split.
At this moment harvesting functions do not have any recovery protocols. This is rather painful with larger requests to OAI-PMH because there is no good way of splitting a request into chunks apart from
resumptionToken
.In general there are the following types of errors:
handle_errors
httr::GET
, which is used in several places, but most importantly inwhile_oai
, which is the core oflist_*
functions.xml2::read_xml
. For example, contains UTF-8 characters that are illegal for XML, that are not escaped etc.It would be useful to come up with a way to recover from such errors. For example:
resumptionToken
without parsing the whole XML withread_xml
. One option is just to write a regular expression. Perhaps another option is related to https://github.com/hadley/xml2/issues/10 if it gets implemented.GET
, but also by adding an optional sleep time between requests inwhile_oai
.resumptionToken
usually comes with aexpirationDate
so in order no to overload the server, harvesting could wait some time, not longer that theexpirationDate
, until issuing next request with the token.