netwerk-digitaal-erfgoed / dataset-register

Components (API and crawler) for the NDE Dataset Register
https://datasetregister.netwerkdigitaalerfgoed.nl/api/
European Union Public License 1.2
4 stars 3 forks source link

Mark datasets that must be included in CLARIAH Dataset Registry #483

Open ddeboer opened 2 years ago

ddeboer commented 2 years ago

How can we annotate dataset descriptions in the NDE Registry to make it clear they should be harvested by CLARIAH?

Currently CLARIAH harvests all datasets for a selection of publishers. The publisher selection is configured on the CLARIAH side. Is having all datasets harvested the desired behaviour?

See https://github.com/CLARIAH/clariah-plus/issues/97.

EnnoMeijers commented 2 years ago

Before that we have to decide which party makes the decision that datasets should be marked for harvest by CLARIAH or which scenarios we need to support. I see three possibilities:

coret commented 2 years ago

- CLARIAH decides which datasets is of interest to them

And administer these markings in the CLARIAH infrastructure?

- institutes promote their dataset to be of interest for the CLARIAH user group

This would mean the source is changed, directed by our adjusted requirements (besides the includeInEuropeana?). Although I like this "at the source" option, I do wonder how many institutions will use an includeInClariah-like predicate. And, can a dataset supplier decide that it is a relevant CLARIAH dataset, or will the promotion be handled as a suggestion by CLARIAH?

- the Datasetregister team decides which group of datasets could be relevant for use within CLARIAH (in close coordination with CLARIAH)

I do not like the extra manual work these annotations would mean for the team. And bear in mind, you'd have to 'judge' every new (set of) datasets. NB: the same of course true for the first option (CLARIAH).

ddeboer commented 1 year ago

@coret and I will think about a way to describe this in the dataset description RDF and requirements.

ddeboer commented 1 year ago

Proposal:

<dataset> schema:audience <https://www.europeana.eu> , 
  <https://clariah.nl> ,
  <https://www.collectienederland.nl> .

In DCAT/DCT that would be dct:audience.

We should document the enum of audiences in our requirements.

coret commented 1 year ago

Implementation strategy:

ddeboer commented 1 year ago

We have decided for now to publish only KB, B&G and IISG because these are CLARIAH partners.