Unable to read in xml file

mcs07 / ChemDataExtractor

Automatically extract chemical information from scientific documents

http://chemdataextractor.org

MIT License

305 stars 113 forks source link

Unable to read in xml file #33

Open sophiatabchouri opened 4 years ago

sophiatabchouri commented 4 years ago

USpatenttest.xml.zip

Having trouble reading in this XML file with the generic XMLReader. It's downloaded from the WIPO patenscope site.

I run: from chemdataextractor import Document f = open('USpatenttest.xml', 'rb') doc=Document.from_file(f)

And I get the error File "/home/ubuntu/miniconda3/envs/reverie_env/lib/python3.6/site-packages/chemdataextractor/reader/markup.py", line 208, in parse root = self._css(self.root_css, root)[0] IndexError: list index out of range

Any advice is greatly appreciated! Thanks

maddenfederico commented 4 years ago

Each parser has a detect() method to determine whether it should be the one to parse a given file. My guess is that the US patent XML parser isn't registering your file. Note a comment left in its detect() method

        if b'us-patent-grant' in fstring:
            return True
        # TODO: Other DTDs

So you probably have to make a subclass of UsptoXmlReader and override the detect() method to accept your file, then pass that subclass into the readers parameter of Document.from_file()

lameturkey commented 4 years ago

I am trying to parse an xml file using the generic XMLReader and I am also getting this error. When I use the function lxml.etree.fromstring directly, it parses fine. My xml isn't an US patent, as such I can't use the specific reader for this.

It seems when I change the root_css query from html to :root my document can be successfully parsed.

fmoorhof commented 1 year ago

We had similar issues: Valid .xml but IndexError on parsing. Inspired by the unit tests we wrote a manual parser for PMC (NlmXmlReader: for other formats you can change the reader to your use case. See here: http://chemdataextractor.org/docs/reading):

import io
from chemdataextractor.reader import NlmXmlReader

def read_xml_file(fname: str) -> str:
    """Read a xml file manually"""
    r = NlmXmlReader()
    body = io.open(os.path.join(os.path.dirname(__file__), xml_file), 'rb')
    content = body.read()

    return r.readstring(content)

fname = 'Your/Path/file.xml'
doc = read_xml_file(fname=fname)