Create (filtered) datacatalogs of harvested datasetdescriptions

netwerk-digitaal-erfgoed / dataset-register

Components (API and crawler) for the NDE Dataset Register

https://datasetregister.netwerkdigitaalerfgoed.nl/api/

European Union Public License 1.2

4 stars 3 forks source link

Create (filtered) datacatalogs of harvested datasetdescriptions #858

Open coret opened 5 months ago

coret commented 5 months ago

The set of harvested and converted datasetdescriptions could also be published as one or more "NDE Datasetregister Catalogs" (in DCAT as that the model we use in the triplestore). One catalog could just be the set of all datasets (=unfiltered).

This could possibly also benefit aggregators/harvesters like CLARIAH, Europeana and data.overheid.nl, where we introduce filters to limit datasets in a catalog, eg. on (a set of) publisher(s). The catalog could be "published" via the Datasetregister API or static files, where the results in the format of a DCAT Data Catalogus are the result of a SPARQL query with some configured filter (maybe via .rq files?).

These datacatalogs increase the findability of heritage datasets.

coret commented 5 months ago

Another way to make the filters is to let dataset publishers define the audience in the datasetdescription. This way the Dataset Register could make an audience specific datacatalog (for Europeana, CLARIAH, DONL, ...).

ddeboer commented 5 months ago

Note that we currently assume a dataset is part of a single catalog. If I understand you correctly, the relation dataset–catalog would become many–many.

coret commented 5 months ago

Note that we currently assume a dataset is part of a single catalog.

Where does this assumption come from, or where is this coded? A heritage organisation can register one or more data catalogs, these do not have to be disjoint in terms of datasets. I can imagine that when a dataset from catalog B is processed, and this dataset was also part of catalog A, then the link between dataset and A would be overwritten with catalog B? Unless the dataset has a schema:includedInDataCatalog (which has cardinality 0..n) with both catalog A and B.

If I understand you correctly, the relation dataset–catalog would become many–many.

In theory yes, but I wonder if we should add triples. Why not just make files (easier to process by harvesters/aggregators)?