ropensci / oai

OAI-PMH R client
https://docs.ropensci.org/oai
Other
14 stars 4 forks source link

Handling XML errors #24

Closed mbojan closed 5 years ago

mbojan commented 8 years ago

Extracted from #17

Malformed XML errors - it may happen that the downloaded XML is malformed and cannot be parsed by xml2::read_xml. For example, contains UTF-8 characters that are illegal for XML, that are not escaped etc.

If the XML is malformed (3 in #17) it could be written somewhere without parsing. That would require some fallback mechanism of getting the resumptionToken without parsing the whole XML with read_xml. One option is just to write a regular expression. Perhaps another option is related to hadley/xml2#10 if it gets implemented.

mbojan commented 8 years ago

Yet another option for getting the resumptionToken is to use xml2::read_html. It seems to be much more tolerant with malformed files (including illegal characters which seem to be deleted).

It should not be used instead of read_xml though, because, among other things, it makes or XML tags in lower case. But seems OK if we are only interested in getting the token.

mbojan commented 8 years ago

I am trying the following modification for while_oai to (try to) proceed with harvesting even in case of XML faults:

while is.character(token)
  GET()
  try to parse XML with `read_xml` (with optional removal of invalid characters)
  IF parsing is ok
    check for oai-pmh errors
    look for `resumptionToken`
    process the results as determined by `as` and `verb`
    collect the results in `out` and/or pass to dumper
  ELSE (i.e. read_xml fails)
   try to parse with `read_html`, if fails dump raw to file and stop()
   check for oai-pmh errors
   look for `resumptionToken`
   IF `as="raw"` 
     collect raw results in `out` and/or pass to dumper
   ELSE
    dump raw XML to a file
    warning("bad XML dumped to file")
  IF has `resumptionToken`
   token <- resumptionToken
  ELSE
   token <- 1
mbojan commented 8 years ago

The above assumes that the result of parsing with read_html is unreliable. So we write raw XML to a file and try to proceed with the resumptionToken if any.

sckott commented 8 years ago

@mbojan tests are now failing on handle_errors fxn, the class returned is no longer oai-pmh_error, but Rcpp::exception - any thoughts?

run the test suite to see what happens

mbojan commented 8 years ago

I'm on it.

mbojan commented 8 years ago

Looks like OAI-PMH service at pbn.nauka.gov.pl is malfunctioning (certificate problems). Only those tests seem to fail.

sckott commented 8 years ago

Hmm, okay, anything we should do to fail better in those cases?

mbojan commented 8 years ago

I'll change the test URLs and see if it the tests pass correctly.

What was failing is actually httr::GET not oai error handling. These test are suppose to test the correct catching of OAI-PMH errors conditional upon an assumption that the test URLs actually lead to these errors. So I don't think there is a need to modify the tests apart from coming up with URLs that are correctly returning OAI-PMH exceptions from a fully functional OAI-PMH server. What do you think?

That's a general problem with testing your system against some external system...

sckott commented 8 years ago

Okay, i'll have a look at the http request error catching

mbojan commented 8 years ago

Do you think they deserve a dedicated "net" of tests to catch?

One thing I might add is the OAI-PMH error handling tests first check whether the request returns a proper result at all before parsing it to learn what the OAI-PMH exception is.

sckott commented 8 years ago

Do you think they deserve a dedicated "net" of tests to catch?

we'll see, I'll look into it

first check whether the requests a proper result at all before parsing it to learn what the OAI-PMH exception is.

makes sense

mbojan commented 8 years ago

Perhaps it makes sense for those tests that rely on contacting some OAI-PMH service to first check whether the service is available and then skip the tests if it is not available?

Inspired by "Skipping a test" here http://r-pkgs.had.co.nz/tests.html .

We would have to write something like oai_available(url) though.

sckott commented 8 years ago

yeah, sounds good

sckott commented 5 years ago

closing for now, we can open a new issue if ther'es still problems along these lines