Mark datasets that must be included in CLARIAH Dataset Registry

netwerk-digitaal-erfgoed / dataset-register

Components (API and crawler) for the NDE Dataset Register

https://datasetregister.netwerkdigitaalerfgoed.nl/api/

European Union Public License 1.2

4 stars 3 forks source link

Mark datasets that must be included in CLARIAH Dataset Registry #483

Open ddeboer opened 2 years ago

ddeboer commented 2 years ago

How can we annotate dataset descriptions in the NDE Registry to make it clear they should be harvested by CLARIAH?

Currently CLARIAH harvests all datasets for a selection of publishers. The publisher selection is configured on the CLARIAH side. Is having all datasets harvested the desired behaviour?

See https://github.com/CLARIAH/clariah-plus/issues/97.

EnnoMeijers commented 2 years ago

Before that we have to decide which party makes the decision that datasets should be marked for harvest by CLARIAH or which scenarios we need to support. I see three possibilities:

CLARIAH decides which datasets is of interest to them
institutes promote their dataset to be of interest for the CLARIAH user group
the Datasetregister team decides which group of datasets could be relevant for use within CLARIAH (in close coordination with CLARIAH)

coret commented 2 years ago

- CLARIAH decides which datasets is of interest to them

And administer these markings in the CLARIAH infrastructure?

- institutes promote their dataset to be of interest for the CLARIAH user group

This would mean the source is changed, directed by our adjusted requirements (besides the includeInEuropeana?). Although I like this "at the source" option, I do wonder how many institutions will use an includeInClariah-like predicate. And, can a dataset supplier decide that it is a relevant CLARIAH dataset, or will the promotion be handled as a suggestion by CLARIAH?

- the Datasetregister team decides which group of datasets could be relevant for use within CLARIAH (in close coordination with CLARIAH)

I do not like the extra manual work these annotations would mean for the team. And bear in mind, you'd have to 'judge' every new (set of) datasets. NB: the same of course true for the first option (CLARIAH).

ddeboer commented 1 year ago

@coret and I will think about a way to describe this in the dataset description RDF and requirements.

ddeboer commented 1 year ago

Proposal:

<dataset> schema:audience <https://www.europeana.eu> , 
  <https://clariah.nl> ,
  <https://www.collectienederland.nl> .

In DCAT/DCT that would be dct:audience.

We should document the enum of audiences in our requirements.

coret commented 1 year ago

Implementation strategy:

document in requirements > audience per dataset (not per datacatalog or per organisation)
make separate "audience" graph where the Datasetregister Team stores initial audience triples for KB/B&G datasets for Clariah
inform audiences (=aggregators) about this selection method for their aggregation purposes
invite dataset providers to use schema:audience and phase out the "audience" graph (=shift of respondibility)

ddeboer commented 1 year ago

We have decided for now to publish only KB, B&G and IISG because these are CLARIAH partners.