mloesch / sickle

Sickle: OAI-PMH for Humans
Other
106 stars 42 forks source link

Record Range Pull Inconsistency #46

Closed csrajath closed 3 years ago

csrajath commented 4 years ago

I am using Python's Sickle library to harvest metadata records from 'http://export.arxiv.org/oai2', with a condition to obtain records published between 2020-01-01 to 2020-01-10 only.

Below is my code block.

from sickle import Sickle
sickle = Sickle('http://export.arxiv.org/oai2')
records = sickle.ListRecords(**{'metadataPrefix': 'oai_dc', 'from': '2020-01-01', 'until': '2020-01-10', 'ignore_deleted':'True'})
for i in records:
    metadata = i.get_metadata()
    title = metadata.get('title')[0]
    print(metadata)
    break

Yet, it is giving an output of a record published to Arxiv on 2007-05-14. This is a bit confusing. Can you please help?

mloesch commented 3 years ago

The dates in the record headers are in the expected range, but the dates in the metadata differ. Maybe the metadata dates are the original publication dates whereas the dates in the header are when it was published to Arxiv.

That is beyond the scope of the library and should be clarified Arxiv directly.