Open pvgenuchten opened 3 weeks ago
How structured should the harvester sources table be?
When adding a new harvester, there should be no need to create a new table
Also benefit would be that the harvester is a separate process then the analysis of the records, from this one could expect a very basic structure:
hash | source-repository | content (xml or json document) | timestamp |
---|---|---|---|
uA4xf3h7 | bonares | <gmd:Md_Metadata .... | 2024-06-03 |
with source-repository linking to a table
source | type | endpoint | filter | harvest-interval | last-successfull |
---|---|---|---|---|---|
bonares | csw | https://bonares.org/csw | weekly | 2024-06-03 |
benefit of using hash as identifier, makes it easy to identify if a record has been harvested before, as well as captures history if records are modified
Metadata harvests run as CI-CD tasks using owslib, oaiharvest or sparql
initial work indicates that harvester sources can best be stored in a postgres database, prior to loading them in the triple store
some of the cleaning (deduplication, translation, augmentation) can best be managed in the database
this task aims to set up a harvester-sources database and routines to populate it from harvesters, and export it to triple store
DoD