titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

Inconsistent PMID and DOI when using parse_xml_web for an XML file. #135

Closed JIBSN closed 2 months ago

JIBSN commented 2 months ago

Describe the bug

{'title': 'Discovery of PROTAC BCL-X',
 'abstract': 'Anti-apoptotic protein BCL-XL plays a key role in tumorigenesis and cancer chemotherapy resistance, rendering it an attractive target for cancer treatment. However, BCL-XL inhibitors such as ABT-263 cannot be safely used in the clinic because platelets solely depend on BCL-XL to maintain their viability. To reduce the on-target platelet toxicity associated with the inhibition of BCL-XL, we designed and synthesized PROTAC BCL-XL degraders that recruit CRBN or VHL E3 ligase because both of these enzymes are poorly expressed in human platelets compared to various cancer cell lines. We confirmed that platelet-toxic BCL-XL/2 dual inhibitor ABT-263 can be converted into platelet-sparing CRBN/VHL-based BCL-XL specific degraders. A number of BCL-XL degraders are more potent in killing cancer cells than their parent compound ABT-263. Specifically, XZ739, a CRBN-dependent BCL-XL degrader, is 20-fold more potent than ABT-263 against MOLT-4 T-ALL cells and has >100-fold selectivity for MOLT-4\xa0cells over human platelets. Our findings further demonstrated the utility of PROTAC technology to achieve tissue selectivity through recruiting differentially expressed E3 ligases.',
 'journal': 'European journal of medicinal chemistry',
 'affiliation': 'Department of Medicinal Chemistry, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States.; Department of Pharmacodynamics, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States.; Department of Pharmacodynamics, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States.; Department of Medicinal Chemistry, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States.; Department of Medicinal Chemistry, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States.; Department of Pharmacodynamics, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States.; Department of Pharmacodynamics, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States.; Department of Pharmacodynamics, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States. Electronic address: zhoudaohong@cop.ufl.edu.; Department of Medicinal Chemistry, College of Pharmacy, University of Florida, 1333 Center Drive, Gainesville, FL, 32610, United States. Electronic address: zhengg@cop.ufl.edu.',
 'authors': 'Xuan Zhang; Dinesh Thummuri; Xingui Liu; Wanyi Hu; Peiyi Zhang; Sajid Khan; Yaxia Yuan; Daohong Zhou; Guangrong Zheng',
 'keywords': 'D000814:Aniline Compounds;D000970:Antineoplastic Agents;D001792:Blood Platelets;D049109:Cell Proliferation;D002470:Cell Survival;D018360:Crystallography, X-Ray;D004305:Dose-Response Relationship, Drug;D055808:Drug Discovery;D004354:Drug Screening Assays, Antitumor;D006801:Humans;D008958:Models, Molecular;D015394:Molecular Structure;D059748:Proteolysis;D013329:Structure-Activity Relationship;D013449:Sulfonamides;D014407:Tumor Cells, Cultured;D051020:bcl-X Protein',
 'doi': '10.1021/acs.jmedchem.9b01530',
 'year': '2020',
 'language': 'eng',
 'pmid': '32145645'}

To Reproduce

import pubmed_parser as pp
dict_out = pp.parse_xml_web("32145645", save_xml=False)

Expected behavior A clear and concise description of what you expected to happen. The correct doi for PMID 32145645 is DOI: 10.1016/j.ejmech.2020.112186

Screenshots If applicable, add screenshots to help explain your problem. image

Michael-E-Rose commented 2 months ago

The problem is this loop in parse_pubmed_web_tree(): https://github.com/titipata/pubmed_parser/blob/15c477a579bda06642a436e56c149fbf89546ba6/pubmed_parser/pubmed_web_parser.py#L135-L140 The loop traverses all references (!) and returns the last DOI.

Instead, the attribute DOI should directly extract the <ELocationID EIdType="doi" ValidYN="Y"></ELocationID> tag. See https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=32145645 for the present document.