titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
559 stars 164 forks source link
article doi medline-xml nlp parse parser pmid pubmed-central pubmed-parser python xml

Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

License DOI DOI Build Status

Pubmed Parser is a Python library for parsing the PubMed Open-Access (OA) subset , MEDLINE XML repositories, and Entrez Programming Utilities (E-utils). It uses the lxml library to parse this information into a Python dictionary which can be easily used for research, such as in text mining and natural language processing pipelines.

For available APIs and details about the dataset, please see our wiki page or documentation page for more details. Below, we list some of the core funtionalities and code examples.

Available Parsers

Below, we list available parsers from pubmed_parser.

Parse PubMed OA XML information

We created a simple parser for the PubMed Open Access Subset where you can give an XML path or string to the function called parse_pubmed_xml which will return a dictionary with the following information:

 [['last_name_1', 'first_name_1', 'aff_key_1'],
  ['last_name_1', 'first_name_1', 'aff_key_2'],
  ['last_name_2', 'first_name_2', 'aff_key_1'], ...]
 [['aff_key_1', 'affiliation_1'],
  ['aff_key_2', 'affiliation_2'], ...]
import pubmed_parser as pp
dict_out = pp.parse_pubmed_xml(path)

Parse PubMed OA citation references

The function parse_pubmed_references will process a Pubmed Open Access XML file and return a list of the PMIDs it cites. Each dictionary has keys as follows

dicts_out = pp.parse_pubmed_references(path) # return list of dictionary

Parse PubMed OA images and captions

The function parse_pubmed_caption can parse image captions from a given path to XML file. It will return reference index that you can refer back to actual images. The function will return list of dictionary which has following keys

dicts_out = pp.parse_pubmed_caption(path) # return list of dictionary

Parse PubMed OA Paragraph

For someone who might be interested in parsing the text surrounding a citation, the library also provides that functionality. You can use parse_pubmed_paragraph to parse text and reference PMIDs. This function will return a list of dictionaries, where each entry will have following keys:

This IDs can merge with output from parse_pubmed_references .

dicts_out = pp.parse_pubmed_paragraph('data/6605965a.nxml', all_paragraph=False)

Parse PubMed OA Table [WIP]

You can use parse_pubmed_table to parse table from XML file. This function will return list of dictionaries where each has following keys.

dicts_out = pp.parse_pubmed_table('data/medline16n0902.xml.gz', return_xml=False)

Parse MEDLINE XML

MEDLINE XML has a different XML format than PubMed Open Access. The structure of XML files can be found in MEDLINE/PubMed DTD here. You can use the function parse_medline_xml to parse that format. This function will return list of dictionaries, where each element contains:

XMLs for the same paper. You can delete the record of deleted paper because it got updated.

dicts_out = pp.parse_medline_xml('data/medline16n0902.xml.gz',
                                 year_info_only=False,
                                 nlm_category=False,
                                 author_list=False,
                                 reference_list=False) # return list of dictionary

To extract month and day information from PubDate, set year_info_only=True. We also allow parsing structured abstract and we can control display of each section or label by changing nlm_category argument.

Parse MEDLINE Grant ID

Use parse_grant_id in order to parse MEDLINE grant IDs from XML file. This will return a list of dictionaries, each containing

If no Grant ID is found, it will return None

Parse MEDLINE XML from eutils website

You can use PubMed parser to parse XML file from E-Utilities using parse_xml_web . For this function, you can provide a single pmid as an input and get a dictionary with following keys

dict_out = pp.parse_xml_web(pmid, save_xml=False)

Parse MEDLINE XML citations from website

The function parse_citation_web allows you to get the citations to a given PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

dict_out = pp.parse_citation_web(doc_id, id_type='PMC')

Parse Outgoing XML citations from website

The function parse_outgoing_citation_web allows you to get the articles a given article cites, given a PubMed ID or PubMed Central ID. This will return a dictionary which contains the following keys

dict_out = pp.parse_outgoing_citation_web(doc_id, id_type='PMID')

Identifiers should be passed as strings. PubMed Central ID's are default, and should be passed as strings without the 'PMC' prefix. If no citations are found, or if no article is found matching doc_id in the indicated database, it will return None.

Installation

You can install the most update version of the package directly from the repository

pip install git+https://github.com/titipata/pubmed_parser.git

or install recent release with PyPI using

pip install pubmed-parser

or clone the repository and install using pip

git clone https://github.com/titipata/pubmed_parser
pip install ./pubmed_parser

You can test your installation by running pytest --cov=pubmed_parser tests/ --verbose in the root of the repository.

Example snippet to parse PubMed OA dataset

An example usage is shown as follows

import pubmed_parser as pp
path_xml = pp.list_xml_path('data') # list all xml paths under directory
pubmed_dict = pp.parse_pubmed_xml(path_xml[0]) # dictionary output
print(pubmed_dict)

{'abstract': u"Background Despite identical genotypes and ...",
 'affiliation_list':
  [['I1': 'Department of Biological Sciences, ...'],
   ['I2': 'Biology Department, Queens College, and the Graduate Center ...']],
  'author_list':
  [['Dennehy', 'John J', 'I1'],
   ['Dennehy', 'John J', 'I2'],
   ['Wang', 'Ing-Nang', 'I1']],
 'full_title': u'Factors influencing lysis time stochasticity in bacteriophage \u03bb',
 'journal': 'BMC Microbiology',
 'pmc': '3166277',
 'pmid': '21810267',
 'publication_year': '2011',
 'publisher_id': '1471-2180-11-174',
 'subjects': 'Research Article'}

Example Usage with PySpark

This is a snippet to parse all PubMed Open Access subset using PySpark 2.1

import os
import pubmed_parser as pp
from pyspark.sql import Row

path_all = pp.list_xml_path('/path/to/xml/folder/')
path_rdd = spark.sparkContext.parallelize(path_all, numSlices=10000)
parse_results_rdd = path_rdd.map(lambda x: Row(file_name=os.path.basename(x),
                                               **pp.parse_pubmed_xml(x)))
pubmed_oa_df = parse_results_rdd.toDF() # Spark dataframe
pubmed_oa_df_sel = pubmed_oa_df[['full_title', 'abstract', 'doi',
                                 'file_name', 'pmc', 'pmid',
                                 'publication_year', 'publisher_id',
                                 'journal', 'subjects']] # select columns
pubmed_oa_df_sel.write.parquet('pubmed_oa.parquet', mode='overwrite') # write dataframe

See scripts folder for more information.

Core Members

and contributors

Dependencies

Citation

If you use Pubmed Parser, please cite it from JOSS as follows

Achakulvisut et al., (2020). Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset. Journal of Open Source Software, 5(46), 1979, https://doi.org/10.21105/joss.01979

or using BibTex

@article{Achakulvisut2020,
  doi = {10.21105/joss.01979},
  url = {https://doi.org/10.21105/joss.01979},
  year = {2020},
  publisher = {The Open Journal},
  volume = {5},
  number = {46},
  pages = {1979},
  author = {Titipat Achakulvisut and Daniel Acuna and Konrad Kording},
  title = {Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset},
  journal = {Journal of Open Source Software}
}

Contributions

We welcome contributions from anyone who would like to improve Pubmed Parser. You can create GitHub issues to discuss questions or issues relating to the repository. We suggest you to read our Contributing Guidelines before creating issues, reporting bugs, or making a contribution to the repository.

Acknowledgement

This package is developed in Konrad Kording's Lab at the University of Pennsylvania. We would like to thank reviewers and the editor from JOSS including tleonardi, timClicks, and majensen. They made our repository much better!

License

MIT License Copyright (c) 2015-2020 Titipat Achakulvisut, Daniel E. Acuna