titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
584 stars 168 forks source link

Processing deleted Medline "citations" in NLM XML records #17

Closed daniel-acuna closed 8 years ago

daniel-acuna commented 8 years ago

How should we process the delete citations?

Sometimes the update XML comes with "deleted" citations (like this example), and it would be good to know which PMID were deleted.

For example, the stats for the update file medline16n0906.xml available [here ftp://ftp.nlm.nih.gov/nlmdata/.medlease/gz/medline16n0906_stats.html] says that there are 8809 citations and 353 delete citations. If we process the XML with pubmed_parser, we correctly get 8809-353 = 8456 records. Use the code below to test this:

# adapted from http://stackoverflow.com/questions/18772703/read-a-file-in-buffer-from-ftp-python
from ftplib import FTP
import gzip
import StringIO

def open_ftp_data(server, path, binary=True):
    ftp = FTP(server)
    ftp.login()

    data_io = StringIO.StringIO()
    def handle_data(more_data):
        data_io.write(more_data)
    if binary:
        resp = ftp.retrbinary("RETR " + path, callback=handle_data)
    else:
        resp = ftp.retrlines("RETR " + path, callback=handle_data)
    data_io.seek(0) # Go back to the start
    ftp.close()
    return data_io

import pubmed_parser as pp
binary_file = open_ftp_data('ftp.nlm.nih.gov', 'nlmdata/.medlease/gz/medline16n0906.xml.gz')
zippy = gzip.GzipFile(fileobj=binary_file)
medline_xml = zippy.read()
dict_records = pp.parse_medline_xml(medline_xml)
print("pubmed_parser records processed: {}".format(len(dict_records)))

Output

pubmed_parser records processed: 8456

We can find the deleted citations by simply

from lxml import etree
root = etree.fromstring(medline_xml)
print("Delete citations {}".format(len(root.xpath('//DeleteCitation/PMID'))))

Output

Delete citations 353
titipata commented 8 years ago

Should we have an option in parse_medline_xml if user want to remove it or not? I would say, set default as not removing delete citations. We might have to mention in documentation or function what is "Delete citations" means too.

daniel-acuna commented 8 years ago

The delete citations may refer to records in other XML files. I would say, lets have an option, say return_deleted for parse_medline_xml that would return two results: the first is the usual list of dicts and the second is the list of PMID that are listed as delete in the XML. By default, lets make return_deleted as False.

To sum up, if return_delete = False, the behavior is the same as now. if it is True, then return what we are retuning now + a list of PMID that are listed as delete.

titipata commented 8 years ago

@daniel-acuna Can you apply changes to the function?

daniel-acuna commented 8 years ago

Of course!

titipata commented 8 years ago

BTW, how about just adding one more field to output dictionary delete: True or delete: False. In that case, it will make output more consistent, in my opinion.

daniel-acuna commented 8 years ago

I was just thinking about this and I think you are right. I'll add a field.