titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
564 stars 164 forks source link

Parsing DOI Medline XML #68

Closed sakrifor closed 4 years ago

sakrifor commented 4 years ago

DOI is not always under ELocationID in MEDLINE XML files. In some cases it is in ArticleIdList like the following:

<ArticleIdList>
        <ArticleId IdType="pubmed">225</ArticleId>
        <ArticleId IdType="doi">10.1055/s-0028-1106478</ArticleId>
</ArticleIdList>

Along the DOI field other ids may also exist such as pubmed & pmc. The example above is from pubmed19n0001.xml.

I may be able to PR a fix by the end of the week which will first check for ELocationID and then for ArticleIdList if that's okay.

titipata commented 4 years ago

Thanks @sakrifor, yes, that would be great if you can make the PR! Otherwise, I can check it by next week.

sakrifor commented 4 years ago

@titipata I stumbled upon an issue while trying to fix this. It seems that the DTD has changed the previous years and the MedlineCitationSet entity is not used anymore. It has been changed to PubmedArticleSet -> PubmedArticle. So, it needs to change from the start of the tree parsing in parse_medline_xml() method and then check all the other info-parsers for issues. I'll try to change this but I am not sure how long it will take for a PR.

Good news, they are not going to change the DTD the next year (NLM Data News)

titipata commented 4 years ago

Ah, thanks @sakrifor. Thank you so much for the information. I got a DTD. I'll take a look and fix it by this week!

titipata commented 4 years ago

@sakrifor I couldn't find the following format of DOI from pubmed19n0001.xml. Can you somehow give some sample index of where in the file or which PMID from pubmed19n0001.xml that you cannot retrieve DOI?

sakrifor commented 4 years ago

@titipata By searching for ArticleId IdType="doi" in the pubmed19n0001.xml I find 1747 instances. The first occurrence is on line 34765 and the PMID for that article is 225 (Title: [Adverse effects of anti-epileptic drugs]. ) and the second occurrence is on line 35053 with the PMID 227 (Title: Human and monkey prolactin and growth hormone: separation of polymorphic forms by isoelectric focusing. ) and so on. I hope this helps.

titipata commented 4 years ago

@sakrifor thanks a ton! I made a PR in #69. Can you try out the branch if it works for most of the Medline XML files?

sakrifor commented 4 years ago

@titipata Yes, it seems to parse all the available DOIs now. Thank you very much! Tested in several XMLs.