Closed mbojan closed 5 years ago
Yet another option for getting the resumptionToken
is to use xml2::read_html
. It seems to be much more tolerant with malformed files (including illegal characters which seem to be deleted).
It should not be used instead of read_xml
though, because, among other things, it makes or XML tags in lower case. But seems OK if we are only interested in getting the token.
I am trying the following modification for while_oai
to (try to) proceed with harvesting even in case of XML faults:
while is.character(token)
GET()
try to parse XML with `read_xml` (with optional removal of invalid characters)
IF parsing is ok
check for oai-pmh errors
look for `resumptionToken`
process the results as determined by `as` and `verb`
collect the results in `out` and/or pass to dumper
ELSE (i.e. read_xml fails)
try to parse with `read_html`, if fails dump raw to file and stop()
check for oai-pmh errors
look for `resumptionToken`
IF `as="raw"`
collect raw results in `out` and/or pass to dumper
ELSE
dump raw XML to a file
warning("bad XML dumped to file")
IF has `resumptionToken`
token <- resumptionToken
ELSE
token <- 1
The above assumes that the result of parsing with read_html
is unreliable. So we write raw XML to a file and try to proceed with the resumptionToken
if any.
@mbojan tests are now failing on handle_errors
fxn, the class returned is no longer oai-pmh_error
, but Rcpp::exception
- any thoughts?
run the test suite to see what happens
I'm on it.
Looks like OAI-PMH service at pbn.nauka.gov.pl is malfunctioning (certificate problems). Only those tests seem to fail.
Hmm, okay, anything we should do to fail better in those cases?
I'll change the test URLs and see if it the tests pass correctly.
What was failing is actually httr::GET
not oai error handling. These test are suppose to test the correct catching of OAI-PMH errors conditional upon an assumption that the test URLs actually lead to these errors. So I don't think there is a need to modify the tests apart from coming up with URLs that are correctly returning OAI-PMH exceptions from a fully functional OAI-PMH server. What do you think?
That's a general problem with testing your system against some external system...
Okay, i'll have a look at the http request error catching
Do you think they deserve a dedicated "net" of tests to catch?
One thing I might add is the OAI-PMH error handling tests first check whether the request returns a proper result at all before parsing it to learn what the OAI-PMH exception is.
Do you think they deserve a dedicated "net" of tests to catch?
we'll see, I'll look into it
first check whether the requests a proper result at all before parsing it to learn what the OAI-PMH exception is.
makes sense
Perhaps it makes sense for those tests that rely on contacting some OAI-PMH service to first check whether the service is available and then skip the tests if it is not available?
Inspired by "Skipping a test" here http://r-pkgs.had.co.nz/tests.html .
We would have to write something like oai_available(url)
though.
yeah, sounds good
closing for now, we can open a new issue if ther'es still problems along these lines
Extracted from #17
Malformed XML errors - it may happen that the downloaded XML is malformed and cannot be parsed by
xml2::read_xml
. For example, contains UTF-8 characters that are illegal for XML, that are not escaped etc.If the XML is malformed (3 in #17) it could be written somewhere without parsing. That would require some fallback mechanism of getting the
resumptionToken
without parsing the whole XML withread_xml
. One option is just to write a regular expression. Perhaps another option is related to hadley/xml2#10 if it gets implemented.