DataService and DataDistributionService

agreiner commented 6 years ago

We need to clarify the terms DataService and DataDistributionService and DiscoveryService. In the overview definitions, the parent-child relationships should be clear. We say that a DataDistributionService represents a service that provides access to distributions of datasets and extracts of datasets. If it is providing access to distributions alone, then it seems to me that it is not acting as an API, as we are narrowing the definition of a distribution to not include APIs. Maybe this is just a case where the intention was to allow it to apply to an API that also provides access to bulk downloads. If so, it just needs to be clearer. A also think that DiscoveryService should be a direct child of DataService, as a DiscoveryService does not provide access to distributions; it provides access to DataDistributionServices.

agreiner commented 6 years ago

I'll also note here that in section 6.10 the usage note still says a distribution can be an API. I think this is just an oversight.

agbeltran commented 6 years ago

Dropping here this link to an example implementing dcat:DataDistributionService provided by @nicholascar:

http://linked.data.gov.au/dataset/gnaf?_format=text/turtle

dr-shorthair commented 5 years ago

@agreiner I have attempted to clean up the description of data services etc in the context of adding some more explicit text around the scope of DCAT (datasets and data services) along with potential for external extensions through further sub-classing of dcat:Resource. I wonder if this also resolves the concerns that you raised in this issue? Please see https://rawgit.com/w3c/dxwg/dcat-issue649-simon/dcat/index.html#dcat-scope and related.

agreiner commented 5 years ago

Thanks, @dr-shorthair for giving that a go. I still think it is difficult to understand the relationship between DataService and DataDistributionService as it is described there in the definitions. Maybe the problem is that we don't come out and say there that a DataDistributionService is a type of DataService, which leaves the reader initially trying to differentiate them since they sound similar. If the reader grasps that DataDistributionService is a type of DataService, then the natural assumption from that point is that a DiscoveryService is the other type of DataService. But, no, it is actually just a specialization of DataDistributionService, though it offers no additional attributes, which makes it seem kind of useless. The description below talks about data transformation services and data processing services, but none of those exists in the vocabulary, so that raises more questions. Why is a discovery service special? After all, it is essentially a DataDistributionService that serves a dataset that is a catalog.

I suspect that we are having trouble describing these things because the relationships are not quite what's needed. It seems odd to me to have an entity that is a specialization of another entity that is itself a specialization of another entity, where neither of the two lower level entities carries more information than its parent entity. The DataDistributionService has the attribute of servesDataset, but a DiscoveryService serves a dataset, too (a catalog). If we think it is useful to distinguish a discovery service from an API, we need to fill out those distinctions in the vocabulary so that they are functionally different. For my money, the real difference is not the type of dataset but the fact that an API is intended for programmatic use and a discovery service is intended for human use through a user agent. Useful general information about an API along those lines might be the type of API or the location of documentation if it is separate from the root endpoint. But I'm having a hard time thinking of general information about a discovery service that isn't covered by being a resource or a DataService already.

kcoyle commented 5 years ago

The definition of DataService says "represents a description of a data service in a catalog. A data service is a collection of operations, accessible through an interface (API)"

Would access through a GUI be included in the use of API here? I don't see a mention of access via browsers but that has been discussed in the past. Perhaps that is dealt with elsewhere?

andrea-perego commented 5 years ago

@agreiner , just to be sure, your proposal would then be to keep just dcat:DataService and drop dcat:DataDistributionService?

On a different note:

For my money, the real difference is not the type of dataset but the fact that an API is intended for programmatic use and a discovery service is intended for human use through a user agent.

There's also the use case of using a discovery service (as an OAI-PMH or CSW endpoint, or the CKAN API) for metadata harvesting and for other machine-to-machine operations.

About services vs APIs, personally, I find it difficult to tell them apart.

dr-shorthair commented 5 years ago

@agreiner wrote

Maybe the problem is that we don't come out and say there that a DataDistributionService is a type of DataService, which leaves the reader initially trying to differentiate them since they sound similar.

Right. Figure 1 shows that DiscoveryService is a subclass of DataDistributionService which is a subclass of DataService, but this was not clear in the accompanying text. I've attempted to remedy that now - see https://rawgit.com/w3c/dxwg/dcat-issue432-simon/dcat/index.html#dcat-scope

DiscoveryService is the other type of DataService. But, no, it is actually just a specialization of DataDistributionService, though it offers no additional attributes, which makes it seem kind of useless

There is a little more to it. There is an existential qualifier at line 302 in the RDF representation that requires a dcat:servesDataset refer to at least one dcat:Catalog. See https://github.com/w3c/dxwg/blob/gh-pages/dcat/rdf/dcat.ttl#L302 :

dcat:DiscoveryService
  rdf:type owl:Class ;
  rdfs:label "Discovery Service" ;
  rdfs:subClassOf dcat:DataDistributionService ;
  rdfs:subClassOf [
      rdf:type owl:Restriction ;
      owl:onProperty dcat:servesDataset ;
      owl:someValuesFrom dcat:Catalog ;
    ] ;
.

I've updated the Usage note to mention this.

The motivation for including DiscoveryService was becasue it is one of the key services defined by INSPIRE. Nevertheless, this classification can also be achieved using the dct:type property, and the existential qualifier mentioned above is a minor detail. So I would not fight too hard if the group preferred to drop the class from the vocabulary, and just explain that a discovery-service is a kind of data-distribution-service which serves one or more catalogs.

However, it would be a mistake to collapse the hierarchy entirely, by dropping DataService, as this provides the extensibility point for other kinds of data services, such as data-processing and transformation services. I've added a usage note to the definition of dcat:DataService pointing this out - https://rawgit.com/w3c/dxwg/dcat-issue432-simon/dcat/index.html#Class:Data_Service .

dr-shorthair commented 5 years ago

Renamed branch to dcat-issue432-simon and edited the links in the previous comment. Please follow revised links from GitHub rather than the email alert.

agreiner commented 5 years ago

@andrea-perego I guess I didn't mean to offer a specific solution, but to point out that what we have now has some problems from my perspective. One option I see is to remove DataDistributionService and DataDiscoveryService but keep DataService and use dct:type to differentiate the various types of services. Another option is to keep DataDistributionService as a subclass of DataService and add other service types as siblings to DataDistributionService, not children, but only if we can make some useful distinctions in the metadata for the various sibling types.

dr-shorthair commented 5 years ago

DataDistributionService is the rdfs:domain of the property dcat:servesDataset which binds a service to one or more specified datasets. It is also the rdfs:range of the property dcat:accessService. So DataDistributionService does have specific metadata that distinguishes it from the general DataService.

agreiner commented 5 years ago

Well, yes, but only because it's been defined that way. Could a DataService of type "dataAPI" or whatnot take those properties?

dr-shorthair commented 5 years ago

@agreiner I don't understand the problem. You asked if there was a useful distinction. I pointed out that a DataDistributionService is distinguished from a generic DataService by being bound to specified Datasets. Every DataDistributionService is a DataService but not vice-versa, so we can identify this subclass on the basis of all its members being tied to one or more Datasets. Since I can explain what distinguishes it from the more general case, the class exists.

I suppose the underlying issue might be whether that particular distinction merits a named class? My feeling was that DataDistributionService is useful because it preserves DataService as an extension point for other kinds of services that are not bound to a specified Dataset.

kcoyle commented 5 years ago

Following up on this I read the sections on DDService and DService. The definition of DDService is confusing to me:

"A site or end-point for discovery, access or processing data or related resources."

Does "processing data" mean "for the processing of data"? If so, it doesn't read that way now ("processing data" could be an adjective/noun pair), so I would suggest changing it to:

A site or end-point for discovery, for access, or for the processing of data or related resources.

[removed section on DDS definition - was looking at old version]

I also agree with @agreiner that it feels odd that there is just this one kind of service that is a subclass of DataService and I wonder if it couldn't somehow be accommodated within DataService. Is there functionality that would be lost (or gained!) if dcat:servesDataset were to be used with DataService?

dr-shorthair commented 5 years ago

How about these definitions -

DS - "A site or end-point providing operations related to the discovery of, access to, or processing functions on data or related resources"

DDS - "A site or end-point that provides access to distributions of datasets"

And we might also add a usage note on DDS - "A Discovery Service is a DDS that provides access to a Catalog" ... and then drop the separate DiscoveryService class.

agbeltran commented 5 years ago

Those definitions sound good to me. Should the note say "A Data Discovery Service is a DDS that provides access to a Catalog"?

dr-shorthair commented 5 years ago

I've made some further clarifications in the text of the normative and introductory (scope) sections. Following discussion towards the end of today's DCAT telecon, the named class DiscoveryService has been dropped.

@agreiner and @kcoyle please read these to check if they work for you:

kcoyle commented 5 years ago

Thanks, @dr-shorthair - now I see what the "processing" was about and it's much clearer.

That resolves the issue of the definitions, but I think there hasn't been an answer to @agreiner 's question (which I seconded) asking if DataDistributionService couldn't be treated as a DataService with necessitating its own class. Couldn't this be one of the possible values of dcat:endpointDescription?

dr-shorthair commented 5 years ago

@kcoyle I thought that had I addressed @agreiner concerns here.

The second usage note here points out that dct:type can be used for typing. And IMO DDS is an important enough sub-class that it merits its own name - see also the first usage note here

kcoyle commented 5 years ago

I'll let @agreiner answer whether she thinks her concerns are addressed.

Yes, I saw the dct:type can be used for typing, which is why I thought that DDS could be a type (although I note that there isn't a service type vocabulary that would include it). I also asked if other services cannot be "bound to one or more specified Datasets" within a catalog. To me that sounds logical that they could be, and DDS is said as "A DataDistributionService is usually bound to one or more specified Datasets..." - note "usually" so this isn't a deciding factor for DDS, just a common option. Perhaps I am missing something because I'm not very familiar with the use of catalogs for datasets.

I was confused by the 2nd usage note in DDS because it refers to a Data Discovery service as a class and the change history in E refers to it as a class, but I don't see it in the list of classes. There's an example with DiscoveryService in D.3. Is that something that hasn't been completed in this draft? The change history makes it sound like it has been done so that was puzzling.

dr-shorthair commented 5 years ago

The point about a DDS is that its functionality is limited to enabling download of 'existing' discrete datasets, or extracts/subsets of them. It doesn't usually do any other processing, such as re-sampling or interpolation or combination (though it is easy to imagine specializations that could). This is such a common case of DS that we thought it merited its own class on that ground alone. Would the addition of the word 'existing' in the definition do the trick for you?

In the majority of cases, a DDS is bound to one or more known datasets. However, there are federated services for which the served datasets are only known transitively, though the operation signature is otherwise that of a 'normal' DDS. I suppose we could put a cardinality constraint that there is at least one dcat:servesDataset.

I've changed 'data discovery service' to lower case, to suppress any implication that it is a named class. The DiscoveryService class was in earlier drafts. In today's DCAT meeting we decided to drop it, but there were still some remnants in the document - I think I've removed them all now.

kcoyle commented 5 years ago

@dr-shorthair Thanks! Adding "existing" would be helpful, yes. It helps explain the difference between the DDS and the DS.

As for adding the cardinality constraint... As a Dublin Core "simple core" person I'm generally reluctant to bake rules into a vocabulary since it's hard to anticipate future needs. This sounds like an instruction or bit of advice that would be perfect for a primer. And if there is interest in doing a primer, I should check to see if it would have to happen on our watch or if a future DCAT community group could be responsible for it, creating it as a community group note. That would take the pressure off of this group. (That's a note to self...)

dr-shorthair commented 5 years ago

Yeah - me too on the cardinality constraints. As I pointed out above, while a reference to an 'existing' dataset might be expected, a federated service might not be able to do that so easily ...

I hope we've bottomed this out now? See #656

kcoyle commented 5 years ago

Can we hear from @agreiner if she's happy with the result? That would make it final, I think.

agreiner commented 5 years ago

I still find it odd to have both DS and DDS. Simon's explanation makes some sense, and I understand that a data distribution service whose job it is solely to deliver datasets and parts thereof can be seen as a subset of all data services, but I'm wondering how that distinction is useful, since we don't offer ways of describing the alternatives. We can say among ourselves that they exist, but that doesn't help someone trying to describe one of these services understand how to use DCAT to do that. If we know they exist, why have we made no attempt to describe them in our vocabulary?

Moreover, I think we are still in need of a clear way to indicate that something is an API. Suppose I have an API that provides data about protein sequences. Users can download data about protein sequences, but they specify what they want by entering some existing sequence and specifying that they want to look for similarity. Is that a DDS? It is doing processing on the back end to determine what to return (much more than just a database search), but it is only returning existing data. If I later switch it to using precomputed data, do I have to change how I describe it? Does it matter if it is reachable only by running a command-line program? Can a web app with a graphical UI be a DDS? What about a web site that has links to datasets for direct download? If someone switches it from html-only to a database-backed service, does it need to be described differently? Suppose I have another API that offers bus route information. Does it change from being a DDS to something else if I add a feature to compute estimated arrival times?

If we want to make DCAT work well for APIs, why don't we have a clear term for that? I would really like to see DCAT offer a way to clearly label something as an API without getting into gray areas about what it's serving.

agbeltran commented 5 years ago

@agreiner would you be able to join us on the DCAT call to discuss this as we need to finalise this part of the spec?

agreiner commented 5 years ago

Okay, I'll be there this afternoon. -Annette

Annette Greiner NERSC Data and Analytics Services Lawrence Berkeley National Laboratory

On Jan 23, 2019, at 6:54 AM, Alejandra Gonzalez-Beltran notifications@github.com wrote:

@agreiner would you be able to join us on the DCAT call to discuss this as we need to finalise this part of the spec?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

riccardoAlbertoni commented 5 years ago

I have noticed the following sentence in DCAT document

"To extend the scope of service descriptions beyond data distribution services it is recommended to define additional sub-classes of dcat:DataService in a DCAT application profile or other DCAT application."

I wonder if we could be a little bit less restrictive here. For example, in certain cases, the proposal made by @andrea-perego of having reference to external code list sounds like a very reasonable alternative to subclassing (see last night DCAT call) Could we give both options as suggestion for extensions and let the users decide what is more suitable for them?

agreiner commented 5 years ago

Now that I've been charged with making a formal proposal, it would help to have a copy of the original UML diagram in some non-bitmapped form. Is that available? I'm not seeing anything in Git.

dr-shorthair commented 5 years ago

I prepared the diagram using Sparx Enterprise Architect. This allows for quite nice customizable diagrams, and was convenient for me as I have used it in the past and already have it on my machine. It uses a proprietary format (in an Access database) and can export bitmapped images only - sorry.

I can share the .EAP file if you want to use that. (It is not in the GitHub repo since it is not diff-able.)

dr-shorthair commented 5 years ago

@riccardoAlbertoni - yes, that is fair enough. The dct:type classifier enables specialization without requiring named sub-classes. Could you make a PR?

agreiner commented 5 years ago

@dr-shorthair thanks, but it sounds like I won't be able to do anything with that proprietary file format. I'll mock up my suggestion in something else.

agreiner commented 5 years ago

Okay, so, I've got something to look at over at https://agreiner.github.io/dxwg/dcat/. This is mocking up how I imagine it would look if we were to make use of the distinction between an API and a web application to define subclasses of data services. I've also uploaded an SVG for the overview UML diagram, so future iterations shouldn't need to be redrawn from scratch. I'm still debating in my mind whether we might want to add a property architecturalStyle that takes literal text. I've been unable to find any attempts to standardize such a thing. Yet, it's probably the first question any programmer would ask when deciding whether they want to dig into a given API (after "Is it an API or just a web site"). One can use conformsTo to indicate a standard, but there is no real standard for the most common type of API on the web (REST). If anyone knows of a reasonable codification of architectural styles, that would be great to check out.

andrea-perego commented 5 years ago

@agreiner said:

I'm still debating in my mind whether we might want to add a property architecturalStyle that takes literal text. I've been unable to find any attempts to standardize such a thing. Yet, it's probably the first question any programmer would ask when deciding whether they want to dig into a given API (after "Is it an API or just a web site").

I wonder whether this is addressed by API descriptions as the ones done with Swagger/OpenAPI, etc. In such a case, this information can be part of the "document" linked to by using dcat:endpointDescription.

agreiner commented 5 years ago

Yes, that is my current design, though what I've been debating in my mind is whether it wouldn't be useful to make it so that a user wouldn't have to follow a link to determine the basic architectural style. BTW, in my design, I'm calling documentation endpointDocumentation rather than endpointDescription. I think it's easier for people to understand the intention, and that it is a URL.

riccardoAlbertoni commented 5 years ago

dr-shorthair wrote:

@riccardoAlbertoni - yes, that is fair enough. The dct:type classifier enables specialization without requiring named sub-classes. Could you make a PR?

Yes, I will work on it

agbeltran commented 5 years ago

thanks @agreiner for your PR #778 related to this

we discussed in the DCAT group that it would be good to see examples on data services to show how the two proposed representations would deal with them - would you be able to provide an example of your representation @agreiner ? e.g. the equivalent to the examples already available for the current representation (e.g. https://w3c.github.io/dxwg/dcat/#a-dataset-available-from-a-service)

I also noted that a proposal for WebAPI will probably be available in schema.org soon (see https://github.com/schemaorg/schemaorg/issues/1423) and it is based on https://webapi-discovery.github.io/rfcs/rfc0001.html - this is more generic but the DataAPI would fall under this

agreiner commented 5 years ago

This is in the PR already, but it would look something like this. Example 12

:dataset-004 rdf:type dcat:Dataset ; dcat:distribution :dataset-004-csv ; dcat:distribution :dataset-004-png ; . :dataset-004-csv rdf:type dcat:Distribution ; dcat:accessService :table-service-005 ; dcat:accessURL http://example.org/api/table-005 ; dcat:mediaType https://www.iana.org/assignments/media-types/text/csv ; . :dataset-004-png rdf:type dcat:Distribution ; dcat:accessService :figure-service-006 ; dcat:accessURL http://example.org/api/figure-006 ; dcat:mediaType https://www.iana.org/assignments/media-types/image/png ; . :figure-service-006 rdf:type dcat:DataAPI ; dct:conformsTo http://example.org/apidef/figure/v1.0 ; dct:type https://inspire.ec.europa.eu/metadata-codelist/SpatialDataServiceType/view ; dcat:endpointDocumentation http://example.org/api/figure-006/params ; dcat:endpointURL http://example.org/api/figure-006 ; dcat:servesDataset :dataset-004 ; . :table-service-005 rdf:type dcat:DataAPI ; dct:conformsTo http://example.org/apidef/table/v2.2 ; dct:type https://inspire.ec.europa.eu/metadata-codelist/SpatialDataServiceType/download ; dcat:endpointDocumentation http://example.org/api/table-005/capability ; dcat:endpointURL http://example.org/api/table-005 ; dcat:servesDataset :dataset-003, :dataset-004 ;

I like the detail in the schema.org proposal, though much of what is there is already covered by the various properties of a resource in DCAT. It's been my understanding that the DCAT subgroup did not want to extend the vocabulary to information already contained in API documentation, since we can link to it, and it expands the scope of DCAT otherwise.

dr-shorthair commented 5 years ago

In DCAT meeting today we agreed to collapse DDS into DS, and see if we can call it done for the current iteration of DCAT. https://www.w3.org/2019/03/13-dxwgdcat-minutes#x05

Much discussion of whether (web) data-applications are also services, or something else, what the relationship between UIs and APIs is, whether they are the same service, how many entries in a catalogue, whether a landing page is a UI, We agreed to defer to the next phase of DCAT (i.e. after this revision). Also relates to #778

davebrowning commented 5 years ago

The proposal is in branch annette-dataservices with attendant deferred PR #816 where the details can be reviewed etc

davebrowning commented 5 years ago

It was agreed in earlier discussions (e.g. #825, #908) that this dialog was a useful starter for potential future work but that it couldn't be completed within the time available for DCAT2. There is also some work drafted in this branch which might help stimulate a future development.

dr-shorthair commented 4 years ago

@agreiner is this issue still live? Are there things in your old pull-request that need to be considered? If so, maybe we home in on those in another issue, and close this issue since the major issue (DDS vs DS) was resolved long ago now?

agreiner commented 4 years ago

I think we can close this one.

w3c / dxwg

DataService and DataDistributionService #432