titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
587 stars 168 forks source link

It couldn't recognize the xml file I downloaded from pubmed #148

Open wildwhip opened 4 months ago

wildwhip commented 4 months ago

I have downloaded the xml file from "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/" and then

{
    "name": "XPathEvalError",
    "message": "Error in xpath expression",
    "stack": "---------------------------------------------------------------------------
XPathEvalError                            Traceback (most recent call last)
Cell In[14], line 3
      1 import pubmed_parser as pp
      2 path_xml = pp.list_xml_path(\"...\xml\")
----> 3 pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output
      4 print(pubmed_dict)

File ......\\pubmed_parser\\pubmed_oa_parser.py:182, in parse_pubmed_xml(path, include_path, nxml)
    179     subjects = \"\"
    181 # create affiliation dictionary
****--> 182 affil_id = tree.xpath(\".//aff[@id]/@id\")****
    183 if len(affil_id) > 0:
    184     affil_id = list(map(str, affil_id))

File src\\\\lxml\\\\etree.pyx:2342, in lxml.etree._ElementTree.xpath()

File src\\\\lxml\\\\xpath.pxi:342, in lxml.etree.XPathDocumentEvaluator.__call__()

File src\\\\lxml\\\\xpath.pxi:210, in lxml.etree._XPathEvaluatorBase._handle_result()

XPathEvalError: Error in xpath expression"
}
Michael-E-Rose commented 2 months ago

Which file specifically do you mean with the xml file? Also, which version are you using?

wildwhip commented 2 months ago

from https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ the lastest version

image

Michael-E-Rose commented 2 months ago

All of them, or a particular xml file? The website has multiple hundred xml files.

titipata commented 2 months ago

I think you might need to use pp.parse_medline_xml instead. The PubMed one is for PubMed Open Access corpus.

wildwhip commented 2 months ago

All of them, or a particular xml file? The website has multiple hundred xml files.

yes,all of them

wildwhip commented 2 months ago

I think you might need to use pp.parse_medline_xml instead. The PubMed one is for PubMed Open Access corpus.

where can i find this tool?