harvester sources database

pvgenuchten commented 3 weeks ago

initial work indicates that harvester sources can best be stored in a postgres database, prior to loading them in the triple store

some of the cleaning (deduplication, translation, augmentation) can best be managed in the database

this task aims to set up a harvester-sources database and routines to populate it from harvesters, and export it to triple store

flowchart LR
    A[Cordis] -->|SPARQL| HS(Harvester sources)
    B[OpenAire] -->|DOI| HS
    c[Bonares] -->|CSW| HS
    HS --> MH(Harmonised)
    MA[Augmentation] --> MH
    MH --> MA
    MH --> Catalogue
    MH --> 3S[Triple Store]
    MH --> LLM

DoD

[x] Database schema
[x] Created database
[x] Harvest Cordis to database
[x] Augment DOI's from CORDIS from OpenAire (and crossref.org?)
[x] Harvest Bonares to database
[ ] Merge records from various sources to harmonised database
[ ] Export metadata to triple store
[ ] Optimise metadata to fit pycsw expectations

pvgenuchten commented 3 weeks ago

How structured should the harvester sources table be?

When adding a new harvester, there should be no need to create a new table

Also benefit would be that the harvester is a separate process then the analysis of the records, from this one could expect a very basic structure:

hash	source-repository	content (xml or json document)	timestamp
uA4xf3h7	bonares	<gmd:Md_Metadata ....	2024-06-03

with source-repository linking to a table

source	type	endpoint	filter	harvest-interval	last-successfull
bonares	csw	https://bonares.org/csw		weekly	2024-06-03

benefit of using hash as identifier, makes it easy to identify if a record has been harvested before, as well as captures history if records are modified

pvgenuchten commented 3 weeks ago

Metadata harvests run as CI-CD tasks using owslib, oaiharvest or sparql

soilwise-he / harvesters

harvester sources database #22