titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

parse_pubmed_paragraph() function seems to miss some paragraphs sometimes. #111

Open zhao-zy15 opened 2 years ago

zhao-zy15 commented 2 years ago

Describe the bug I was preparing for a dataset requiring paragraph-level parsing of PMC_OA articles. However, when I try to parse this article with PMC id PMC8075838, there are actually 12 paragraphs in the article but parse_pubmed_paragraph() function returns only 7 paragraphs. Any ideas why? (I have checked the original xml file on my laptop and there is no missing paragraph in the file)

Screenshots image