infrae / pyoai

The oaipmh module is a Python implementation of an "Open Archives$ Initiative Protocol for Metadata Harvesting"
http://pypi.python.org/pypi/pyoai
Other
84 stars 53 forks source link

add XMLparser recover option for client; keep default behavior same: recover=False #39

Open sdm7g opened 5 years ago

sdm7g commented 5 years ago

Adds an option in Client and BaseClient to use recover=True option on etree.XMLParser, so that OAI harvesting won't fail on a bad metadata payload. Default is recover=False, so that it doesn't change any existing behavior ( Making the conservative choice here, just in case anyone is relying on catching failures to validate feeds. )

Initially, I tried parsing first with recover=False, catching errors and then retrying with recover=True, but I discovered that was unnecessary.

  1. I though parsing with recover=True might be slower, but when there are no parser errors to recover from, I could not measure any time penalty for using that option.
  2. I wanted to complete the harvest, but I also wanted to catch and log errors so that I could notify upstream maintainers of the feed that there were issues to be fixed. I discovered that self.XMLParser.error_log will contain any errors if they occur, and that lxml has a facility to pass lxml errors to the Python logging error log module by calling tree.use_global_python_log(). Logging should be configured first before using that facility, so to enable, use something like this in the calling program:
from lxml import etree
import logging
etree.use_global_python_log(etree.PyErrorLog())
logging.basicConfig()

That will output a line like this, for example, on an unescaped ampersand: CRITICAL:root:<string>:47:219:FATAL:PARSER:ERR_ENTITYREF_SEMICOL_MISSING: EntityRef: expecting ';'

lxml documents how that message can be changed or filtered, if for example, you want to change those 'CRITICAL' to 'WARNING' , but I didn't think that was worth adding to the base code if it's not already using logging module.

bloomonkey commented 5 years ago

👍 I'm the author and maintainer of oai-harvest and would love to see this PR merged, to avoid having to maintain a fork of the Client class in that project 🙏

interrogator commented 4 years ago

Looks good to me! Mergeable in my opinion @jascoul, would love to bring down the use of forks with patches for ppl like @bloomonkey