titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
584 stars 168 forks source link

Create function to parse reference list #11

Closed titipata closed 8 years ago

titipata commented 8 years ago

I attach code to do that here but still haven't cleaned it up. @davidbrandfonbrener can you take a look?

import pubmed_parser as pp
from lxml import etree

def join(l):
    return ' '.join(l)

path_xml = pp.list_xml_path('data/')
#tree = etree.parse('data/pntd.0002065.nxml')
tree = etree.parse(path_xml[0])
references = tree.xpath('//ref-list/ref[@id]')
dict_refs = list()
for r in references:
    ref_id = r.attrib['id']
    for rc in r:
        if 'publication-type' in rc.attrib.keys():
            if rc.attrib.values() is not None:
                journal_type = rc.attrib.values()[0]
            else:
                journal_type = ''
            names = list()
            for n in rc.findall('name'):
                name = join([t.text for t in n.getchildren()][::-1])
                names.append(name)
            try:
                article_title = rc.findall('article-title')[0].text
            except:
                article_title = ''
            try:
                journal = rc.findall('source')[0].text
            except:
                journal = ''
            try:
                pmid = rc.findall('pub-id[@pub-id-type="pmid"]')[0].text
            except:
                pmid = ''
            dict_ref = {'ref_id': ref_id, 'name': names, 'article_title': article_title, 
                        'journal': journal, 'journal_type': journal_type, 'pmid': pmid}
            dict_refs.append(dict_ref)
titipata commented 8 years ago

Added. See in pm_parser.py.