titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
577 stars 164 forks source link

Parsers cannot read the xml file. #40

Closed grabear closed 7 years ago

grabear commented 7 years ago

Below I've copied my python instance. I'm trying to parse medline data. I've done this with your pubmed and medline parser on the listed machine as well as on a ubuntu server with the same error. I've also generated a file using the R programming language. If you are familiar with that, the package I used is called easyPubMed. I used the batch_pubmed_download() function.

Anyways I'd really like to use your code, especially as it links the authors with their affiliated institutions. I'm new to XML parsing so I have no idea what I'm doing in that respect.

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32

>>>import pubmed_parser as pp
>>>pp.parse_pubmed_xml('C:\\Users\\Work\\Downloads\\medline16n0902.xml')

Error: it was not able to read a path, a file-like object, or a string as an XML
Traceback (most recent call last):
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\utils.py", line 14, in read_xml
    tree = etree.parse(path)
  File "src\lxml\lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81101)
  File "src\lxml\parser.pxi", line 1811, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:117832)
  File "src\lxml\parser.pxi", line 1837, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:118179)
  File "src\lxml\parser.pxi", line 1741, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:117091)
  File "src\lxml\parser.pxi", line 1138, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:111637)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)
  File "src\lxml\parser.pxi", line 633, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105612)
OSError: Error reading file 'medline16n0902.xml': failed to load external entity "medline16n0902.xml"
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\medline_parser.py", line 354, in parse_medline_xml
    tree = read_xml(path)
  File "C:\Program Files\Python36\lib\site-packages\pubmed_parser-0.1-py3.6.egg\pubmed_parser\utils.py", line 17, in read_xml
    tree = etree.fromstring(path)
  File "src\lxml\lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src\lxml\lxml.etree.c:78994)
  File "src\lxml\parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:118325)
  File "src\lxml\parser.pxi", line 1729, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:116883)
  File "src\lxml\parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:110870)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105093)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106801)
  File "src\lxml\parser.pxi", line 635, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:105655)
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
titipata commented 7 years ago

Hi @robear22890, thanks for the report and sorry for the late reply! So, I seems like the problem is from reading XML file (etree.from_string(path)). Pubmed parser uses these snippet to read XML file. Can you check real quick if lxml works to read example file for you or the file that you have a problem with?

From the error, it seems like you were using the wrong function to parse MEDLINE XML. For the MEDLINE one, you have to use parse_medline_xml function instead of parse_pubmed_xml. parse_pubmed_xml is actually for Pubmed Open-Access subset XML files. Let me know if this solves the problem.

daniel-acuna commented 7 years ago

@robear22890 @titipata It seems that the problem is that it cannot find the file? As @titipata mentions, pubmed_parser tries to read the given string as if it were a file path and if that fails it tries to read it as a XML string. So it first fails to read the file, and then it tries to read it as an XML. Can you please check that the file exists at that location?

daniel-acuna commented 7 years ago

I am closing this for now