Closed titipata closed 8 years ago
I attach code to do that here but still haven't cleaned it up. @davidbrandfonbrener can you take a look?
import pubmed_parser as pp from lxml import etree def join(l): return ' '.join(l) path_xml = pp.list_xml_path('data/') #tree = etree.parse('data/pntd.0002065.nxml') tree = etree.parse(path_xml[0]) references = tree.xpath('//ref-list/ref[@id]') dict_refs = list() for r in references: ref_id = r.attrib['id'] for rc in r: if 'publication-type' in rc.attrib.keys(): if rc.attrib.values() is not None: journal_type = rc.attrib.values()[0] else: journal_type = '' names = list() for n in rc.findall('name'): name = join([t.text for t in n.getchildren()][::-1]) names.append(name) try: article_title = rc.findall('article-title')[0].text except: article_title = '' try: journal = rc.findall('source')[0].text except: journal = '' try: pmid = rc.findall('pub-id[@pub-id-type="pmid"]')[0].text except: pmid = '' dict_ref = {'ref_id': ref_id, 'name': names, 'article_title': article_title, 'journal': journal, 'journal_type': journal_type, 'pmid': pmid} dict_refs.append(dict_ref)
Added. See in pm_parser.py.
pm_parser.py
I attach code to do that here but still haven't cleaned it up. @davidbrandfonbrener can you take a look?