titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
564 stars 164 forks source link

medline xml equivalent for "parse_pubmed_references"? #64

Closed huminpurin closed 4 years ago

huminpurin commented 5 years ago

I get "AttributeError: 'NoneType' object has no attribute 'find'" for using parse_pubmed_references on xml files of MEDLINE/PubMed Data (https://www.nlm.nih.gov/databases/download/pubmed_medline.html)

parse_medline_xml can parse xmls but not getting refference. I checked the xml files and im sure the reference data is in there. Is there any way to get something like "parse_medline_references"?

titipata commented 5 years ago

Hi @huminpurin, can you point to tue sample of XML file that you're obtaining from? I will have more time next week to fix the library.

huminpurin commented 5 years ago

Thanks @titipata I got the file from the official database of national library of medicine: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/

titipata commented 5 years ago

@huminpurin, I see now. So, we currently do not have the implementation for parsing MEDLINE references yet. I am checking out now if I can get the references from MEDLINE. However, implementing such function (parse_medline_xml) would be great to have for the library!

titipata commented 5 years ago

@huminpurin, I actually do not see the reference data from MEDLINE dataset. If you can point to the specific file name that has references data for me, I can try to implement it for you.

huminpurin commented 5 years ago

@titipata Yes, it would be great if there is a (parse_medline_xml) function, afterall all files on nlm database are medline xml. To answer your question, there is a tag as <ReferenceList> in the xml files which lists the references of a paper. Heres some example lines from actual xml file:

Proc Natl Acad Sci U S A. 2012 Apr 10;109(15):5850-5 22454498 ACS Synth Biol. 2017 Jul 21;6(7):1296-1304 28274123 ......
titipata commented 5 years ago

@huminpurin, ah nice, thanks a lot! I did not notice it exists before. It seems like the references are not available for all of the XML. I sample a few publications but still didn't see the ReferenceList. Can you point me specifically which file name did you get this example from?

I will take a look and update with you soon!

huminpurin commented 5 years ago

@titipata Oh I see where the problem is. Not all the files contain reference list in following database ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/ Only some of the files contain reference list. I just checked the last one pubmed19n1117.xml.gz and there is a reference list. Maybe the old data don't have reference list.

titipata commented 4 years ago

@huminpurin, sorry for getting back to this late. I think I got it now. I will update with you in the new PR.

titipata commented 4 years ago

@huminpurin, sorry for getting back to this late. I think I got it now. I will update with you in the new PR.

titipata commented 4 years ago

Fixed in #69.