openstate / open-cultuur-data

The back- and front-end code that powers the Open Cultuur Data API
http://opencultuurdata.nl/
28 stars 18 forks source link

Make extractors remind and use the last harvest date #20

Open coret opened 10 years ago

coret commented 10 years ago

Harvesting big collection over and over isn't the most efficient approach. Most API's and protocol have a means to get records which have been added, changed and deleted after a specific datetime. To use this feature, when the source has this feature, the OCD backend has to 'remember' the last time the source was harvested.

I propose that extractors of 'since capable' sources have to check the getLastHarvested(indexname) function (from misc/lastharvested.py). If this function returns a datetime (not None), then the extractors query to the source has to be extended with the source specific 'since' query.

After the harvest has been completed the setLastHarvested(indexname) function (from misc/lastharvested.py) has to be called. This function inserts or update a 'record' for the specified indexname with the current datetime. I think a simple flat text file will suffice as a database (I'll leave this to capable Pyhton programmers).

In the list of defined extractors (in main and pull requests) there are currently two flavors of 'since capable' sources:

The extractor for the Rijksmuseum uses the 'search API' which doesn't seem to provide a mechanism to fetch only new or updated records. The Rijksmuseum also offers an OAI-PMH API, that API is 'since capable'.

The Arts Holland extractors uses SPARQL, I'm not sure if there's an RDF attribute which can be queried for add/update date.