jakelever / biotext

Get a nicely-chunked local copy of the biomedical literature (to use for other projects)!
MIT License
13 stars 5 forks source link

Dealing with broken PMC files without xlink namespace #12

Open jakelever opened 2 years ago

jakelever commented 2 years ago

A small number of PMC files use the xlink namespace without defining it first. For example, the documents include "xlink:href" where "xlink" hasn't be defined. This breaks the XML parser and gives errors like below.

Traceback (most recent call last):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 390, in pmcxml2bioc
    for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 274, in process_pmc_file
    for event, elem in etree.iterparse(source, events=("start", "end", "start-ns", "end-ns")):
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1222, in iterator
    yield from pullparser.read_events()
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1297, in read_events
    raise event
  File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1269, in feed
    self._parser.feed(data)
xml.etree.ElementTree.ParseError: unbound prefix: line 12, column 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "src/convertPMC.py", line 56, in <module>
    for bioc_doc in pmcxml2bioc(io.StringIO(data)):
  File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 450, in pmcxml2bioc
    raise RuntimeError("Parsing error in PMC xml file: %s" % source)
RuntimeError: Parsing error in PMC xml file: <_io.StringIO object at 0x7f04d1099c18>

An initial hacky fix was implemented in 63663fedad3de49a1bbc50e859821a0d4ee328cd and e30c3e9934c91bf79f0476e938bb9f6995f59027. This tried to fixed href specific cases. This needs to be explored further (as a new non href-related file) has appeared.