titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

parse_pubmed_table() and parse_pubmed_references() returning None #119

Closed octotus closed 1 month ago

octotus commented 1 year ago

Describe the bug I am using pubmed parser to extract table information from xml files. The tests were performed with two papers: PMCID 535340, and 535341. Both papers have tables; PMC535340 has figures too. My attempts at extracting tables is unsuccessful.

To Reproduce

import pubmed_parser as pp

file='./PMC000xxxxxx/PMC535340.xml'
data_xml_parse = pp.parse_pubmed_xml(file) ## this works well. 
data_figure_caption = pp.parse_pubmed_caption(file) ## this works well too. So far so good.
data_table_content = pp.parse_pubmed_table(file) ## ** This fails. tried toggling option to return xml table values - still fails. **
data_references = pp.parse_pubmed_references(file) ## ** This too fails.**

Expected behavior

I was expecting the pp.parse_pubmed_table(file) to return the list of tables + content. I was expecting the pp.parse_pubmed_references(file) to return a list of references.

Screenshots

No error message is produced upon calling these functions.

Dependencies

Windows 11, Jupyter, Python = 3.9, encoding =utf-8

Additional context None.

File added here. PMC535340.zip

Michael-E-Rose commented 1 month ago

I confirm the bug still exists with pubmed_parser 4.0.

Not sure what exactly in https://github.com/titipata/pubmed_parser/blob/master/pubmed_parser/pubmed_oa_parser.py#L507 triggers this. Maybe the xml path changed. In any case, I think we should use more try-except clauses instead of if-else checks.