titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

Extracting sections at a time rather than paragraphs #97

Closed krishmatta closed 3 years ago

krishmatta commented 3 years ago

Hey! Thank you for the project. Is there any way I can extract an entire section (e.g. the entire Introduction section) rather than paragraph by paragraph (which may include "subsections")?

titipata commented 3 years ago

Hi @krishxmatta thanks so much for your issue. Yes, this is definitely possible. After you parse the PubMed OA dataset, you can group the output list by section from the output you got. I'd try something as follows:

from itertools import groupby
from operator import itemgetter 

grouped_paragraphs = list(groupby(paragraphs,  key=itemgetter('section')))

But you might have to see that the value in the key section is consistent or not from the parser.