Harvesting big collection over and over isn't the most efficient approach. Most API's and protocol have a means to get records which have been added, changed and deleted after a specific datetime. To use this feature, when the source has this feature, the OCD backend has to 'remember' the last time the source was harvested.
I propose that extractors of 'since capable' sources have to check the getLastHarvested(indexname) function (from misc/lastharvested.py). If this function returns a datetime (not None), then the extractors query to the source has to be extended with the source specific 'since' query.
After the harvest has been completed the setLastHarvested(indexname) function (from misc/lastharvested.py) has to be called. This function inserts or update a 'record' for the specified indexname with the current datetime. I think a simple flat text file will suffice as a database (I'll leave this to capable Pyhton programmers).
In the list of defined extractors (in main and pull requests) there are currently two flavors of 'since capable' sources:
OpenSearch (Nationaal Archief, Archief Eemland)
See for example: http://www.gahetna.nl/beeldbank-api/opensearch/description-document: ' Query role="example" searchTerms="timestamp:["2011-02-09T00:00:00Z" TO *]" title="Show all records modified since Januari 9th, 2011" '
The extractor for the Rijksmuseum uses the 'search API' which doesn't seem to provide a mechanism to fetch only new or updated records. The Rijksmuseum also offers an OAI-PMH API, that API is 'since capable'.
The Arts Holland extractors uses SPARQL, I'm not sure if there's an RDF attribute which can be queried for add/update date.
Harvesting big collection over and over isn't the most efficient approach. Most API's and protocol have a means to get records which have been added, changed and deleted after a specific datetime. To use this feature, when the source has this feature, the OCD backend has to 'remember' the last time the source was harvested.
I propose that extractors of 'since capable' sources have to check the
getLastHarvested(indexname)
function (from misc/lastharvested.py). If this function returns a datetime (not None), then the extractors query to the source has to be extended with the source specific 'since' query.After the harvest has been completed the
setLastHarvested(indexname)
function (from misc/lastharvested.py) has to be called. This function inserts or update a 'record' for the specified indexname with the current datetime. I think a simple flat text file will suffice as a database (I'll leave this to capable Pyhton programmers).In the list of defined extractors (in main and pull requests) there are currently two flavors of 'since capable' sources:
The extractor for the Rijksmuseum uses the 'search API' which doesn't seem to provide a mechanism to fetch only new or updated records. The Rijksmuseum also offers an OAI-PMH API, that API is 'since capable'.
The Arts Holland extractors uses SPARQL, I'm not sure if there's an RDF attribute which can be queried for add/update date.