Closed yak1r closed 5 years ago
My bad. I meant to open this PR in our fork of your repo, before attempting to push it into yours.
Understandably, the .
in xpath
is debatable. I will get back to you when the forked version will work for us and is ready to be discussed.
No worry at all. Just to make sure! Let me know if you made the changes, I'll check again later. :)
OK my version of the parser is ready so I can finally discuss those changes
First, about the .
in xpath
, when I don't have it, I get an error when trying to find elements:
SyntaxError: cannot use absolute path on element
The xmls I'm running on are pmc articles from ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/
(the xml.gz files of course). It might be that those xmls are different from the xmls you used when writing the parser, but if it works with both of them than perhaps relative path are better.
Second thing I added is the nxml argument. That's because the parser allows receiving xml content in a string, but then there is no way to tell it that this is an nxml and the namespace needs to be removed.
Third, today I noticed not all pmc xmls have day and month in the pub-date section, so I added 01 as the default (If the month or day aren't there, I assume any day within the relevant time frame is OK)
@yak1r thanks for the clarification! I think the PR looks good and I will do the squash and merge!
Thanks @yak1r for the PR! What's the purpose of adding the
.
inxpath
? It's more consistent this way?