"can only parse strings" while reading PMC nxml

titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

http://titipata.github.io/pubmed_parser/

MIT License

580 stars 166 forks source link

"can only parse strings" while reading PMC nxml #43

Closed deakkon closed 4 years ago

deakkon commented 7 years ago

For several PMC files I get an error while reading in the content. list below:

PMC4569614.nxml 
PMC5362956.nxml 
PMC4162892.nxml 
PMC4569628.nxml 
PMC5348996.nxml 
PMC5362810.nxml 
PMC5352161.nxml 
PMC4522714.nxml 
PMC5352154.nxml 
PMC5363022.nxml 
PMC4522719.nxml 
PMC5346358.nxml

P.S. Ill post the traceback when I run it again.

deakkon commented 7 years ago

Traceback:

Error: it was not able to read a path, a file-like object, or a string as an XML
Traceback (most recent call last):
  File "solr_pipeline.py", line 253, in getArticle
    items = pp.parse_pubmed_xml(content)
  File "build/bdist.linux-x86_64/egg/pubmed_parser/pubmed_oa_parser.py", line 81, in parse_pubmed_xml
    tree = read_xml(path)
  File "build/bdist.linux-x86_64/egg/pubmed_parser/utils.py", line 17, in read_xml
    tree = etree.fromstring(path)
  File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77697)
  File "src/lxml/parser.pxi", line 1818, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116475)
ValueError: can only parse strings

daniel-acuna commented 7 years ago

It seems that read_xml tries to read the path as a file path first and then tries it as a string. I am wondering why it failed to parse it as a path in the first place.

Do you have a code snippet reproducing the error?

deakkon commented 7 years ago

Code snippet:

def getArticle(file):

#print file
_, extension = splitext(file)
items=None

try:
    if extension == '.gz':
        content=gzip.open(file)
        items = pp.parse_medline_xml(content)
    elif extension == '.nxml':
        content=open(file,'rb')
        items = pp.parse_pubmed_xml(content)
        if isinstance(items, dict):
            temp = [items]
            items=temp

except Exception as e:
    logger_deubg.debug('{}\n{}\n{}\n{}'.format('getArticle',file,e, traceback.print_exc()))

return items

P.S. My pipeline is set up in a way that all PubMed articles are read from the gz files while the PMC articles are read from the nxml files. Withour this if elif block it didnt work at all

daniel-acuna commented 7 years ago

You have to pass the file path rather the a file pointer. So, call it as pp.parse_pubmed_xml(file)

deakkon commented 7 years ago

As I recall initially I submitted the path to a file (depending on the source to one of the two functions) but there were issues with that. Thats why I implemented this solution which works(ed) for all PMC/PubMed files besides the ones mentioned above.

I can give it a try if you want just to rule that out (or confirm it).

titipata commented 7 years ago

Thanks @deakkon and @daniel-acuna! Is this issue solved now?

@deakkon, can you make Pull-request on the more reliable read_xml function that you made or paste the code snippet here?