titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
580 stars 166 forks source link

XMLSyntaxError #55

Closed deakkon closed 6 years ago

deakkon commented 6 years ago

In [41]: pp.parse_medline_xml('/home/docClass/files/pubmed/pubmed18n1040.xml.gz') Error: it was not able to read a path, a file-like object, or a string as an XML File "", line 1 XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Source: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed18n1040.xml.gz

daniel-acuna commented 6 years ago

I don't think parse_medline_xml parses .gz files. You need to uncompress it first.

deakkon commented 6 years ago

Hi,

are you sure? E.g. pp.parse_medline_xml('pubmed18n0364.xml.gz') (source ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed18n0364.xml.gz)

gives back a list of dicts.

daniel-acuna commented 6 years ago

Can you try uncompress it first? The file works for me

deakkon commented 6 years ago

Sorry, my mistake! The issues was that the file was not properly downloaded (Im performing a batch download and no error was printed out).

Redownloaded it manually and it works directly from the path (skipping uncompressing).

Best, J.