Open sophiatabchouri opened 4 years ago
Each parser has a detect()
method to determine whether it should be the one to parse a given file. My guess is that the US patent XML parser isn't registering your file. Note a comment left in its detect()
method
if b'us-patent-grant' in fstring:
return True
# TODO: Other DTDs
So you probably have to make a subclass of UsptoXmlReader
and override the detect()
method to accept your file, then pass that subclass into the readers
parameter of Document.from_file()
I am trying to parse an xml file using the generic XMLReader and I am also getting this error. When I use the function lxml.etree.fromstring
directly, it parses fine. My xml isn't an US patent, as such I can't use the specific reader for this.
It seems when I change the root_css
query from html
to :root
my document can be successfully parsed.
We had similar issues: Valid .xml but IndexError on parsing. Inspired by the unit tests we wrote a manual parser for PMC (NlmXmlReader: for other formats you can change the reader to your use case. See here: http://chemdataextractor.org/docs/reading):
import io
from chemdataextractor.reader import NlmXmlReader
def read_xml_file(fname: str) -> str:
"""Read a xml file manually"""
r = NlmXmlReader()
body = io.open(os.path.join(os.path.dirname(__file__), xml_file), 'rb')
content = body.read()
return r.readstring(content)
fname = 'Your/Path/file.xml'
doc = read_xml_file(fname=fname)
USpatenttest.xml.zip
Having trouble reading in this XML file with the generic XMLReader. It's downloaded from the WIPO patenscope site.
I run:
from chemdataextractor import Document
f = open('USpatenttest.xml', 'rb')
doc=Document.from_file(f)
And I get the error
File "/home/ubuntu/miniconda3/envs/reverie_env/lib/python3.6/site-packages/chemdataextractor/reader/markup.py", line 208, in parse root = self._css(self.root_css, root)[0] IndexError: list index out of range
Any advice is greatly appreciated! Thanks