soilwise-he / harvesters

MIT License
0 stars 0 forks source link

harvester sources database #22

Open pvgenuchten opened 3 weeks ago

pvgenuchten commented 3 weeks ago

initial work indicates that harvester sources can best be stored in a postgres database, prior to loading them in the triple store

some of the cleaning (deduplication, translation, augmentation) can best be managed in the database

this task aims to set up a harvester-sources database and routines to populate it from harvesters, and export it to triple store

flowchart LR
    A[Cordis] -->|SPARQL| HS(Harvester sources)
    B[OpenAire] -->|DOI| HS
    c[Bonares] -->|CSW| HS
    HS --> MH(Harmonised)
    MA[Augmentation] --> MH
    MH --> MA
    MH --> Catalogue
    MH --> 3S[Triple Store]
    MH --> LLM

DoD

pvgenuchten commented 3 weeks ago

How structured should the harvester sources table be?

When adding a new harvester, there should be no need to create a new table

Also benefit would be that the harvester is a separate process then the analysis of the records, from this one could expect a very basic structure:

hash source-repository content (xml or json document) timestamp
uA4xf3h7 bonares <gmd:Md_Metadata .... 2024-06-03

with source-repository linking to a table

source type endpoint filter harvest-interval last-successfull
bonares csw https://bonares.org/csw weekly 2024-06-03

benefit of using hash as identifier, makes it easy to identify if a record has been harvested before, as well as captures history if records are modified

pvgenuchten commented 3 weeks ago

Metadata harvests run as CI-CD tasks using owslib, oaiharvest or sparql