titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

parse_pubmed_caption() failing on some papers #125

Closed oblodgett closed 1 month ago

oblodgett commented 1 year ago

When parsing certain files for image captions:

import pubmed_parser as pp

pubmed_figuredata = pp.parse_pubmed_caption("PMC9539395.nxml")

Fails with the following error:

_process.py     
Traceback (most recent call last):
  File "test_process.py", line 18, in <module>
    pubmed_figuredata = pp.parse_pubmed_caption(paper_path)
  File "venv_sentence_parsing/lib/python3.8/site-packages/pubmed_parser/pubmed_oa_parser.py", line 425, in parse_pubmed_caption
    fig_label = stringify_children(fig.find("label"))
  File "venv_sentence_parsing/lib/python3.8/site-packages/pubmed_parser/utils.py", line 51, in stringify_children
    [node.text]
AttributeError: 'NoneType' object has no attribute 'text'

I would expect this to parse correctly? Also when parsing image captions the subpoints under the caption label are not available in the output, see that same paper PMC9539395 as an example.