w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
144 stars 55 forks source link

Service vs endpoint #1242

Open andrea-perego opened 4 years ago

andrea-perego commented 4 years ago

I propose we consider re-visiting the service-endpoint relationship, based on existing examples where such relationship is 1-to-many.

DCAT 2 includes two properties - dcat:endpointURL and dcat:endpointDescription - for specifying a service endpoint, plus property dct:conformsTo, which is used to specify its "protocol". As these properties take as subject a dcat:DataService, the service-endpoint relationship is 1-to-1.

There are however cases where services have more than one endpoint.

As an example, see the following record:

https://sdi.eea.europa.eu/catalogue/idp/api/records/29e08b66-e6f6-4b4f-95ad-b582d9fe3df5

The record describes a geospatial "view" service (i.e., a service portraying data on a map) with 2 endpoints, both serving the same dataset, but using different protocols (WMS and ArcGIS REST), and with different endpoint descriptions and URLs. Transformed into DCAT, this record will then be as follows:

:eea_v_4326_250_k_wise-eionet-monitoring-sites_service a dcat:DataService ;
  ...
  dcat:endpointDescription <https://water.discomap.eea.europa.eu/arcgis/rest/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer> ;
  dcat:endpointURL <https://water.discomap.eea.europa.eu/arcgis/rest/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer> ;
  dct:conformsTo <https://developers.arcgis.com/rest/> ;
  dcat:endpointDescription <https://water.discomap.eea.europa.eu/arcgis/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer/WMSServer?request=GetCapabilities&service=WMS> ;
  dcat:endpointURL <https://water.discomap.eea.europa.eu/arcgis/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer/WMSServer> ;
  dct:conformsTo <http://www.opengeospatial.org/standards/wms> ;
  dcat:servesDataset :eea_v_4326_250_k_wise-eionet_p_2001-2020_v01_r04 ;
  ...
.

This case is outlined in DCAT 2 in Example 49, where the solution is to duplicate the service record, and changing only the endpoint protocol, description, and URL. So, the record above should result in two different ones:

:eea-rest a dcat:DataService ;
  ...
  dcat:endpointDescription <https://water.discomap.eea.europa.eu/arcgis/rest/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer> ;
  dcat:endpointURL <https://water.discomap.eea.europa.eu/arcgis/rest/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer> ;
  dct:conformsTo <https://developers.arcgis.com/rest/> ;
  dcat:servesDataset :eea_v_4326_250_k_wise-eionet_p_2001-2020_v01_r04 ;
  ...
.

:eea-wms a dcat:DataService ;
  ...
  dcat:endpointDescription <https://water.discomap.eea.europa.eu/arcgis/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer/WMSServer?request=GetCapabilities&service=WMS> ;
  dcat:endpointURL <https://water.discomap.eea.europa.eu/arcgis/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer/WMSServer> ;
  dct:conformsTo <http://www.opengeospatial.org/standards/wms> ;
  dcat:servesDataset :eea_v_4326_250_k_wise-eionet_p_2001-2020_v01_r04 ;
  ...
.

I think this approach should be complemented with the possibility of keeping instead a single service with 2 endpoints, as in the original record. It would be up to data providers to decide which one would suit them best.

A possible solution is to define a new property (e.g., dcat:endpoint), which specifies the endpoint (possibly typed itself as a dcat:DataService), along with the endpoint description, URL and protocol.

The example above would then be re-written as follows:

:eea_v_4326_250_k_wise-eionet-monitoring-sites_service a dcat:DataService ;
  ...
  dcat:endpoint [ a dcat:DataService ;
    dcat:endpointDescription <https://water.discomap.eea.europa.eu/arcgis/rest/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer> ;
    dcat:endpointURL <https://water.discomap.eea.europa.eu/arcgis/rest/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer> ;
    dct:conformsTo <https://developers.arcgis.com/rest/> ;
  ] ;
  dcat:endpoint [ a dcat:DataService ;
    dcat:endpointDescription <https://water.discomap.eea.europa.eu/arcgis/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer/WMSServer?request=GetCapabilities&service=WMS> ;
    dcat:endpointURL <https://water.discomap.eea.europa.eu/arcgis/services/WISE_SoE/EIONET_MonitoringSite_WM/MapServer/WMSServer> ;
    dct:conformsTo <http://www.opengeospatial.org/standards/wms> ;
  ] ;
  dcat:servesDataset :eea_v_4326_250_k_wise-eionet_p_2001-2020_v01_r04 ;
  ...
.
jakubklimek commented 4 years ago

@andrea-perego What is the benefit of describing this case (and similar cases) as one service with multiple endpoints as opposed to multiple services?

geospatial "view" service (i.e., a service portraying data on a map) with 2 endpoints, both serving the same dataset, but using different protocols (WMS and ArcGIS REST)

Why is this one service and not two services? What makes the identity of a service?

Creating two ways of describing this makes it again harder for consumers of such metadata to work with it - more possibilities = harder to work with. There needs to be a good benefit to this new approach to justify the increased complexity.

smrgeoinfo commented 4 years ago

I have been approaching this from the point of view of someone looking first for data, and then for how to access the data. From this point of view, the DataSet is the primary resource of interest, and different services to access the data would be different distributions. This is similar to Example 49, but I would add distribution elements in the DataSet object that point to the DataService objects via the accessService property. The question of what identifies a service is key. I'd argue that from the point of view of an application parsing the metadata for a dataset to determine how to get the data in a format it can use, the service is defined by 1. the protocol for communicating (transport, request syntax and semantics) 2. information model for the content 3. the serialization scheme(s) for the data in service responses (xml, xml schema, JSON, JSON schema, rdf, rdf vocabulary). [edit 2020-07-05]-- I left off an important one-- 4. the operations that the service offers.

andrea-perego commented 4 years ago

@jakubklimek , @smrgeoinfo , thanks for your feedback, and sorry for my late reply.

This issue is yet to be discussed by DXWG, so no position has been taken for the moment. However, IMO, you rightly pointed out the key issue here, namely "what identifies a service".

I perfectly agree that, from a data-centric perspective, a service may correspond to an endpoint, using a given protocol, etc. This is actually what has been implemented in DCAT 2 for distributions accessible via a service/API. However, in DCAT 2, services/APIs have been also introduced as first-class citizens, and, as such, their existence may not be necessarily bound to the data they serve, and therefore it is arguable whether a service actually corresponds to an endpoint.

My example was related to what happens in the geospatial domain, where services are fist-class citizens of a catalogue. There, as you know, a service is not identified by the endpoint protocol, but rather by a conceptual definition of its "type" (download, view, transformation, invoke service), which can be implemented by using different protocols.

This may not be considered now the "right" way of doing it - there's a lot of discussion about the disadvantages of the service-centred approach used in the geo domain. But, as a matter of fact, there's a wide community following this approach, and producing metadata which are also made available using DCAT - in addition to their native ISO 19115 / ISO 19119 / ISO 19139 representation.

As you say, @jakubklimek , having two different ways of describing the same thing does not help usability. This is definitely something that the DXWG will take into account about this issue. However, this needs to be carefully weighed against the fact that the way services are currently represented in DCAT may prevent / limit its use in specific communities.

smrgeoinfo commented 4 years ago

I'd argue that the only services that makes sense to include as 'first class citizens' (geospatial or not) are processing services. I don't see the logic of cataloging a data service (view, download) separately from the data it serves.
What is an endpoint? Suggestion: an endpoint is a web location (identified by a base URL) at which one can access a particular service. That service might offer various processing and data access options, but in the end there should be a service specification that is implemented at that endpoint that accounts for the 4 aspects mentioned above. An endpoint is a particular implementation of a service, possibly with bindings to particular data. I'd suggest that 'service' be thought of as a specification that might have various software implementations, and each of those implementations might be exposed by various endpoints, and those endpoints might have binding to different data.
Service: protocol, information model, interchange format(s) used, operations offered Endpoint: URL, service specification, coupled data (optional)

dr-shorthair commented 4 years ago

I'd argue that the only services that makes sense to include as 'first class citizens' (geospatial or not) are processing services.

This was discussed at some length during the development of DCAT2. The key points are:

  1. The simplest download service - which just gets a file from a file-system - might not merit a service description much beyond 'conforms to HTTP v1.1'. However, there is a rich spectrum beyond that. At the very least, even if the data is coming from a static datastore, there is usually a query mechanism to select an extract from the whole, both in terms of which records are retrieved, and which properties (columns). And a data service usually provides at least a method to project the result according to some 'schema' described in the request. Then there may be re-sampling, or coordinate transformation, or other processing as well.

The client needs a way to find out about these options - selection/query operations, parameter ranges, response schema and format options - so there needs to be a service description somewhere (it may be implied by the standard that it conforms to).

  1. Some services are tightly bound to a single datastore, some not. But even a processing service that can connect to multiple data sources is also initialized with (i.e. bound to) some 'data' - for example coordinate transformation parameters, or coefficients used in some other numeric.

So I don't think it is so clear where the boundary between 'download' and 'processing' is.

Every service we are interested in in the DCAT context delivers 'data'. Sometimes this is retrieved from a static(-ish) store. Sometimes generated on-the-fly somehow. But the client machine doesn't know and usually doesn't care what happens behind the interface. That's why, after an initial discussion about a small taxonomy of service-types, we decided to just have a single class dcat:DataService.

smrgeoinfo commented 4 years ago

@dr-shorthair I think we're in agreement.

dr-shorthair commented 4 years ago

Also note this W3C note which proposes extensions to schema.org aligned with the DCAT2 model for DataService description - https://webapi-discovery.github.io/rfcs/rfc0001.html

bertvannuffelen commented 3 years ago

In Flanders this discussion has resulted in the following rough guideline:

A Data Service is smart access to one or more Datasets, where the access to the data can be manipulated by the user.
A Distribution of a Dataset is a way to access the data without any user interaction. The user can obtain the complete data as described by the Dataset by a sequences of connected paginated requests.

That means that a download service which allows to download a part of geographic data based on the user selection criteria is a dataservice, while a complete configured call to the same service downloading all geographic information that is described in a dataset is a distribution of that dataset.

smrgeoinfo commented 3 years ago

@bertvannuffelen A Data Service is smart access to one or more Datasets, where the access to the data can be manipulated by the user. 'can be manipulated by the user' implies some kind of API beyond the basic HTTP Get (with no parameters). The current model of Distribution in DCAT provides a downloadURL; this seems to correspond to what you would redefine as a distribution. dcat:Distribution also has an accessURL, for the situations where the user has to go to a landing page from which they can actually access the data; under your suggestion this would apparently be a DataService-- it requires user interaction.

To me the key intention of DataService is that it defines an API to automate interaction, beyond the basic HTTP CRUD operations operating on files that a web browser does (accessURL and downloadURL handle these cases). Humans interact with the dataService through some software application that uses functionality of the DataService API. See https://github.com/w3c/dxwg/issues/1242#issuecomment-647195521, above

rob-metalinkage commented 3 years ago

I think there is an inherent problem with scoping a general property like dct:conformsTo to a specific range - the identifiers of protocol APIs.

Taking on board the multiple different pieces of information identified as required for the user story (independent of the priority assumed), it would appear that the requirement is for one of :

1) dct:conformsTo be multivalued, and the objects support a type query to work out which value relates to which aspect 2) sub-properties of dct:conformsTo for specific aspects 3) range of dct:conformsTo to be an qualified association object that describes different aspects 4) range of dct:conformsTo to be a canonical model for a composite specification capable of describing all aspects 5) do nothing and leave users none the wiser about what dct:conformsTo might actually mean 6) continue to restrict semantics of dct:conformsTo to API identifiers and formally publish DCAT as a profile of dcterms to give this restriction and its relationship to dcterms machine readable home. (leaving users none the wiser about the other aspects) 7) something else?

This is orthogonal AFAICT from the DataService vs. Distribution matter - except that for a distribution being a HTTP downloadable link a sepcial property with these semantics is deemed necessary, but that won't scale to all possible APIs and data model combinations..

bertvannuffelen commented 3 years ago

@smrgeoinfo

To me the key intention of DataService is that it defines an API to automate interaction, beyond the basic HTTP CRUD operations operating on files that a web browser does (accessURL and downloadURL handle these cases). Humans interact with the dataService through some software application that uses functionality of the DataService API. See #1242 (comment), above

Is then a JSON REST API which just implements HTTP CRUD operations for you a distribution? For me that is actually a prototypical example of a dataservice.

Also is a webbrowser for you an software application? Or is is it at the level of an operating system?

It is hard for me to draw strict lines, I am in the search for some good criteria. And the above ones were the best guidelines I could define in a few sentences, which would lead so somehow a coherent catalog.

bertvannuffelen commented 3 years ago

@rob-metalinkage

I think there is an inherent problem with scoping a general property like dct:conformsTo to a specific range - the identifiers of protocol APIs.

Taking on board the multiple different pieces of information identified as required for the user story (independent of the priority assumed), it would appear that the requirement is for one of :

1. dct:conformsTo be multivalued, and the objects support a type query to work out which value relates to which aspect

2. sub-properties of dct:conformsTo for specific aspects

3. range of dct:conformsTo to be an qualified association object that describes different aspects

4. range of dct:conformsTo to be a canonical model for a composite specification capable of describing all aspects

5. do nothing and leave users none the wiser about what dct:conformsTo might actually mean

6. continue to restrict semantics of dct:conformsTo to API identifiers and formally publish DCAT as a profile of dcterms to give this restriction and its relationship to dcterms machine readable home. (leaving users none the wiser about the other aspects)

7. something else?

This is orthogonal AFAICT from the DataService vs. Distribution matter - except that for a distribution being a HTTP downloadable link a sepcial property with these semantics is deemed necessary, but that won't scale to all possible APIs and data model combinations..

It is indeed orthogonal but related, as it is part of the key properties of a data service. Of the options I would not prefer option 3 as it complicates the usage and domain model, and then people would reintroduce it as a simple again. About option 4 I do not grasp fully your intend. Option 6, if I understand correctly, is to make the definition and usage much more restricted.

As a suggestion for 7: leave it to the profile and wait for the practice. This is what we in Flanders have done. We created subproperties of dct:conformsTo to capture specific conformity cases. Any generic DCAT user will of-course get into a situation as described in option 1. But that is as such not a weakness. In my opinion DCAT have taken the route towards covering broad aspects of cataloguing datasets and data services, and therefore it cannot be as precise as a profile.

Nevertheless this discussion about the expectations for dct:conformsTo is important to challenge us about the expectations. Some people really want to do machine processing on it, others want to restrict is only to "official" standards, others might connect it with local implementation agreements (e.g. following an organisation's REST API guidelines). All are fine with me, and have their place under a generic dct:conformsTo.

riccardoAlbertoni commented 1 year ago

Marked as future work, as this is one of a bunch of issues pertaining to data services (see https://github.com/w3c/dxwg/projects/12) that we might want to reconsider in a new perspective in a next round of standardization.