#112 Parse epub insed of mix of ppub and epub

titipata / pubmed_parser

:clipboard: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset

http://titipata.github.io/pubmed_parser/

MIT License

559 stars 164 forks source link

#112 Parse epub insed of mix of ppub and epub #141

Open nils-herrmann opened 1 month ago

nils-herrmann commented 1 month ago

The XML looks like this:

<pub-date pub-type="ppub">
    <month>9</month>
    <year>2005</year>
</pub-date>

<pub-date pub-type="epub">
    <day>31</day>
    <month>5</month>
    <year>2005</year>
</pub-date>

The code was mixing both elements. The new implementation parses the epub

Michael-E-Rose commented 1 month ago

In general the paper publication is more relevant. Otherwise you have authors whose articles got published in the 1970s and suddenly they still publish.

But it would be great to have a new attribute: epublication_date.

Also thanks for already updating the tests!

nils-herrmann commented 1 month ago

New commit parses collection date if ppub missing. try-except to get pub_year.