titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link

turn parse_medline_xml into an iterator to save memory #120

Closed seandavi closed 1 year ago

seandavi commented 1 year ago

This PR makes two main changes:

  1. Adds grant parsing into the primary parser and embeds the results into the main PubMed record.
  2. Converts the parse_medline_xml into an iterator. The largest pubmednXX file uses about 7GB to parse, so this change allows one to reduce memory usage considerably.

I haven't added tests or detailed examples, and this is a significant change to the API and internal behavior, so feel free to ignore the PR.

Finally, thanks for making a really useful piece of software available to the rest of us!

titipata commented 1 year ago

Thanks @seandavi ! I'll check the PR later this week.