A small number of PMC files use the xlink namespace without defining it first. For example, the documents include "xlink:href" where "xlink" hasn't be defined. This breaks the XML parser and gives errors like below.
Traceback (most recent call last):
File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 390, in pmcxml2bioc
for pmc_doc in process_pmc_file(source, tag_handlers=tag_handlers):
File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 274, in process_pmc_file
for event, elem in etree.iterparse(source, events=("start", "end", "start-ns", "end-ns")):
File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1222, in iterator
yield from pullparser.read_events()
File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1297, in read_events
raise event
File "/home/jlever/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/xml/etree/ElementTree.py", line 1269, in feed
self._parser.feed(data)
xml.etree.ElementTree.ParseError: unbound prefix: line 12, column 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/convertPMC.py", line 56, in <module>
for bioc_doc in pmcxml2bioc(io.StringIO(data)):
File "/projects/jlever/github/biotext/src/bioconverters/pmcxml.py", line 450, in pmcxml2bioc
raise RuntimeError("Parsing error in PMC xml file: %s" % source)
RuntimeError: Parsing error in PMC xml file: <_io.StringIO object at 0x7f04d1099c18>
An initial hacky fix was implemented in 63663fedad3de49a1bbc50e859821a0d4ee328cd and e30c3e9934c91bf79f0476e938bb9f6995f59027. This tried to fixed href specific cases. This needs to be explored further (as a new non href-related file) has appeared.
A small number of PMC files use the xlink namespace without defining it first. For example, the documents include "xlink:href" where "xlink" hasn't be defined. This breaks the XML parser and gives errors like below.
An initial hacky fix was implemented in 63663fedad3de49a1bbc50e859821a0d4ee328cd and e30c3e9934c91bf79f0476e938bb9f6995f59027. This tried to fixed href specific cases. This needs to be explored further (as a new non href-related file) has appeared.