titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
564 stars 164 forks source link

Pubdate not returning correct year #58

Closed MatthewDeitz closed 4 years ago

MatthewDeitz commented 5 years ago

There is a problem with some of the pubdate fields in the output. It is not pulling the correct year and instead is splitting the text based off of " " and grabbing the first chunk of text. Because of this you end up with results for pubdate like ["Summer","Winter"]. Some example pmid's this happens for is [28599031,28599032,28599033, etc]. Could you please update to match on some form of regular expression like "\d{4}" instead of splitting on the whitespace and just grabbing the first chunk?

titipata commented 5 years ago

Hello @MatthewDeitz, first, thanks for the issue! Is there any way that you can share XML files of the following PMIDs with me? And yes, I can update using regular expression instead of splitting on the whitespace. Alternatively, if you already fixed the parser, feel free to send the pull request.

titipata commented 5 years ago

@MatthewDeitz, I updated the parse at de61d61. Let me know if this solves the issue.

kaustubhn commented 5 years ago

Hi @titipata I am parsing pubmed files from this location, ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ I have extracted the xml files to a folder and using the pp.parse_medline_xml() with year_info_only=False but I am only getting year and month the parser is not parsing the day even when the day is mentioned in the xml file.

Can you please tell if this is the correct behaviour? If not can you please direct me where the problem may be I will give it a try to fix it.

Thankyou!

titipata commented 5 years ago

Hi @kaustubhn, thanks so much! So it should parse the date if it is available and return it. The function to do that is at https://github.com/titipata/pubmed_parser/blob/master/pubmed_parser/medline_parser.py#L241-L242. Feel free to fix it and make the PR!