w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
144 stars 55 forks source link

Distinguishing between stand alone and tightly coupled data services #1434

Open matthiaspalmer opened 2 years ago

matthiaspalmer commented 2 years ago

Should there be a way to indicate to data portals that certain data services are tightly coupled to datasets? The value would be to allow data portals to filter out such tightly coupled data services from searches as they provide little value when viewed independently.

I can imagine at least three cases of how the relations to datasets look like: A. One to one - There is a tight tight coupling between a dataset and a dataservice. The dataset has a distribution that points to the dataservice via the dcat:accessService. Inversely the data service may point back to the dataset via the dcat:servesDataset, but not to any other dataset.

B. One to many - One data service can leverage data for several datasets, e.g. by providing parameters when accessing the service different amounts of data is returned and these data belong to different datasets. This is expressed by several datasets having distributions that point to the same data service. Inversely the data service may point back to all the dataset it serves via the dcat:servesDataset.

C. Independent - A data service has no connection to a dataset. There can be several reasons for this, e.g. it is a data transformation service (think of a currency converter).

Different data portals may and will of course handle this differently. In the Swedish Data portal it has been deemed that data services that falls under scenario A provide no extra value to show in the search. This is due the fact that they are very similar to the connected dataset and would in many cases look like duplicates. Note that the Swedish data portal have chosen to have a common search against "Data and APIs" as most people won't know or care about the difference from a search perspective, they just want to find the data they are looking for independently of how it can be accessed. On the other hand, data services corresponding to scenario B and C are described in more detail and are shown as search results independently.

To accomplish this Swedish data providers have been encourage to do one of two things to mark a data service as tightly coupled:

  1. Exclude the dcat:service relation from the catalog, or
  2. Don't provide a dcterms:publisher

Note that a data service that is not pointed to directly from the catalog can still be considered to be part of the catalog based on either being reachable via a distribution or by being part of a certain RDF graph (shipped in the same file).

I think it would be useful to have a usage note about this. If not deemed suitable for DCAT 3, I think at least it would be good to make sure that there is nothing in the specification that hinders profile developers from expressing this without being in conflict (I think it is ok as the specification is written today, but it would be appreciated with more eyes on this).

smrgeoinfo commented 2 years ago

The only case in which it would make sense to me to have a DataService as a catalog entry would be case C in your examples, with a scenario like 'I'm looking for a service that does function X and uses API Y'.

Otherwise it seems to me that its only useful to consider the DataService as the target of Dataset-->distribution-->Distribution--accessService. Don't users come into a data catalog looking for data, and then selecting a distribution that is useful for their purpose?

matthiaspalmer commented 2 years ago

@smrgeoinfo I thought so first, but we have a case with Statistics Sweden who provide over 4000 datasets, all of which are accessible via a single API via various filters. Hence, describing this API in some detail so it can be found independently as a high profile API is sensible at least to me. As a note, we only point from the distributions to the data service, pointing back from the data service to the 4000 datasets won't make anyone happy. In fact it just makes loading the metadata in a UI more sluggish.

smrgeoinfo commented 2 years ago

@matthiaspalmer -- that gets tricky, and sounds like there might be some overlap with the #1429 discussion about how to handle series. I still suspect a dataset record about 'Statistics Sweden database' with a distribution//accessService pointing to a DataService object about the API (mostly detailed in the dcat:endpointDescription) would be a workable solution.

matthiaspalmer commented 2 years ago

@smrgeoinfo I would not say the 4000 datasets are in a series. They are of the same character, i.e. datasets of a statistical nature that all have the same kind of datastructure wich means makes it easy to serve them from the same API, but that does not make them into a series.

I think presenting the whole API as a single dataset is not appropriate, it is a wide range of different data in there. Showing it as a standalone data service which can be found just like datasets works quite well when there is a visual indicator that this is a service rather than a dataset.

smrgeoinfo commented 2 years ago

"...presenting the whole API as a single dataset is not appropriate ... a standalone data service ... like [a] datasets works quite well ..."

hmm what does the service serve? The API is one thing, the backend database (your stats data) is another. Since it doesn't appear that the backend is described as multiple datasets, if its not a single dataset (that is a big relational database), what is it?

riccardoAlbertoni commented 1 year ago

Marked as future work, as this is one of a bunch of issues pertaining to data services (see https://github.com/w3c/dxwg/projects/12) that we might want to reconsider in a new perspective in a next round of standardization.