titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

#125 Create exception for empty figures and parse figure sub-points. #139

Closed nils-herrmann closed 1 month ago

nils-herrmann commented 2 months ago

Create exception for empty figures and parse figure sub-points. Further, create one test for every parsed field of a figure.

nils-herrmann commented 2 months ago

The output for the reported article in the bug now looks like this:

[{'pmid': '36094679', 'pmc': '9539395', 'fig_caption': 'Aerosol delivery of sACE22.v2.4‐IgG1 alleviates lung injury and improves survival of SARS‐CoV‐2 gamma variant infected K18‐hACE2 transgenic mice \n\n', 'fig_id': 'emmm202216109-fig-0001', 'fig_label': 'Figure 1', 'fig_subpoints': [('A', 'K18‐hACE2 transgenic mice were inoculated with SARS‐CoV‐2 isolate /Japan/TY7‐503/2021 (gamma variant) at 1\u2009×\u2009104 PFU. sACE22.v2.4‐IgG1 (7.5\u2009ml at 8.3\u2009mg/ml in PBS) was delivered to the mice by a nebulizer in 25\u2009min at 12\u2009h, 48\u2009h, and 84\u2009h postinoculation. PBS was aerosol delivered as control.'), ('B, C', 'Survival (B) and weight loss (C). N\u2009=\u200910 mice for each group. The P‐value of the survival curve by the Gehan–Breslow–Wilcoxon test is shown. Error bars for mouse weight are centered on the mean and show SEM.'), ...}, ...]

I noticed that the captions and sub-points contain unicode escape sequences (\u2009h, \u200910). Should we leave them?