Closed sakrifor closed 4 years ago
Thanks @sakrifor, yes, that would be great if you can make the PR! Otherwise, I can check it by next week.
@titipata I stumbled upon an issue while trying to fix this. It seems that the DTD has changed the previous years and the MedlineCitationSet entity is not used anymore. It has been changed to PubmedArticleSet -> PubmedArticle. So, it needs to change from the start of the tree parsing in parse_medline_xml() method and then check all the other info-parsers for issues. I'll try to change this but I am not sure how long it will take for a PR.
Good news, they are not going to change the DTD the next year (NLM Data News)
Ah, thanks @sakrifor. Thank you so much for the information. I got a DTD. I'll take a look and fix it by this week!
@sakrifor I couldn't find the following format of DOI
from pubmed19n0001.xml
. Can you somehow give some sample index of where in the file or which PMID from pubmed19n0001.xml
that you cannot retrieve DOI
?
@titipata By searching for ArticleId IdType="doi"
in the pubmed19n0001.xml
I find 1747 instances. The first occurrence is on line 34765 and the PMID for that article is 225 (Title: [Adverse effects of anti-epileptic drugs]. ) and the second occurrence is on line 35053 with the PMID 227 (Title: Human and monkey prolactin and growth hormone: separation of polymorphic forms by isoelectric focusing. ) and so on. I hope this helps.
@sakrifor thanks a ton! I made a PR in #69. Can you try out the branch if it works for most of the Medline XML files?
@titipata Yes, it seems to parse all the available DOIs now. Thank you very much! Tested in several XMLs.
DOI is not always under ELocationID in MEDLINE XML files. In some cases it is in ArticleIdList like the following:
Along the DOI field other ids may also exist such as pubmed & pmc. The example above is from pubmed19n0001.xml.
I may be able to PR a fix by the end of the week which will first check for ELocationID and then for ArticleIdList if that's okay.