Closed deakkon closed 4 years ago
Traceback:
Error: it was not able to read a path, a file-like object, or a string as an XML
Traceback (most recent call last):
File "solr_pipeline.py", line 253, in getArticle
items = pp.parse_pubmed_xml(content)
File "build/bdist.linux-x86_64/egg/pubmed_parser/pubmed_oa_parser.py", line 81, in parse_pubmed_xml
tree = read_xml(path)
File "build/bdist.linux-x86_64/egg/pubmed_parser/utils.py", line 17, in read_xml
tree = etree.fromstring(path)
File "src/lxml/lxml.etree.pyx", line 3213, in lxml.etree.fromstring (src/lxml/lxml.etree.c:77697)
File "src/lxml/parser.pxi", line 1818, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116475)
ValueError: can only parse strings
It seems that read_xml
tries to read the path as a file path first and then tries it as a string. I am wondering why it failed to parse it as a path in the first place.
Do you have a code snippet reproducing the error?
Code snippet:
def getArticle(file):
#print file
_, extension = splitext(file)
items=None
try:
if extension == '.gz':
content=gzip.open(file)
items = pp.parse_medline_xml(content)
elif extension == '.nxml':
content=open(file,'rb')
items = pp.parse_pubmed_xml(content)
if isinstance(items, dict):
temp = [items]
items=temp
except Exception as e:
logger_deubg.debug('{}\n{}\n{}\n{}'.format('getArticle',file,e, traceback.print_exc()))
return items
P.S. My pipeline is set up in a way that all PubMed articles are read from the gz files while the PMC articles are read from the nxml files. Withour this if elif block it didnt work at all
You have to pass the file path rather the a file pointer. So, call it as pp.parse_pubmed_xml(file)
As I recall initially I submitted the path to a file (depending on the source to one of the two functions) but there were issues with that. Thats why I implemented this solution which works(ed) for all PMC/PubMed files besides the ones mentioned above.
I can give it a try if you want just to rule that out (or confirm it).
Thanks @deakkon and @daniel-acuna! Is this issue solved now?
@deakkon, can you make Pull-request on the more reliable read_xml
function that you made or paste the code snippet here?
For several PMC files I get an error while reading in the content. list below:
P.S. Ill post the traceback when I run it again.