titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset
http://titipata.github.io/pubmed_parser/
MIT License
564 stars 164 forks source link

parse publication date #57

Closed yak1r closed 5 years ago

titipata commented 5 years ago

Thanks @yak1r for the PR! What's the purpose of adding the . in xpath? It's more consistent this way?

yak1r commented 5 years ago

My bad. I meant to open this PR in our fork of your repo, before attempting to push it into yours. Understandably, the . in xpath is debatable. I will get back to you when the forked version will work for us and is ready to be discussed.

titipata commented 5 years ago

No worry at all. Just to make sure! Let me know if you made the changes, I'll check again later. :)

yak1r commented 5 years ago

OK my version of the parser is ready so I can finally discuss those changes First, about the . in xpath, when I don't have it, I get an error when trying to find elements:

SyntaxError: cannot use absolute path on element

The xmls I'm running on are pmc articles from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/ (the xml.gz files of course). It might be that those xmls are different from the xmls you used when writing the parser, but if it works with both of them than perhaps relative path are better.

Second thing I added is the nxml argument. That's because the parser allows receiving xml content in a string, but then there is no way to tell it that this is an nxml and the namespace needs to be removed.

Third, today I noticed not all pmc xmls have day and month in the pub-date section, so I added 01 as the default (If the month or day aren't there, I assume any day within the relevant time frame is OK)

titipata commented 5 years ago

@yak1r thanks for the clarification! I think the PR looks good and I will do the squash and merge!