w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
144 stars 46 forks source link

Distributions, services and implementation-resources #411

Closed dr-shorthair closed 5 years ago

dr-shorthair commented 5 years ago

Continuing the conversation that has broken out on (closed) issue #52

dr-shorthair commented 5 years ago

Content copied over from #52:

agreiner commented 6 hours ago I don't disagree with the text here, but I think it worth pointing out that it is a bit paradoxical with respect to what some of us have been asserting with regard to profile negotiation. "The definition text of dcat:Distribution has been revised to clarify that distributions are primarily representations of datasets. As such, all distributions of a given dataset should be informationally equivalent. " Here, it is assumed that representations of a dataset are informationally equivalent, but profile negotiation would return datasets that are not informationally equivalent, because different profiles may include different subsets of the dataset. My preference is to keep distributions informationally equivalent and ask ourselves if there is a way to make it clear that profile negotiation does not deliver informationally equivalent responses.

@rob-metalinkage Member rob-metalinkage commented 6 hours ago • Narrowing the scope, as proposed, breaks backwards compatibility with existing DCAT implementations.

Services that support queries against a dataset are never "informationally equivalent

If this restricted view is held, then any distrubution that supports accessing a file remotely, is by definiton not a dcat:Distribution but a dcat:SistributonService. Just basic web architecture of allowing a HTTP HEAD request is sufficient to break information equivalence, and content negotiation over langauge or mime type also does. Different formats are not informationally equivalent - for example a CSV file loses relationships between attributes compared to complex properties:

CSV id,value1, units1, value2,units2 1, 2.3, "m/s",6.7,"kg"

vs JSON { id: 1 ; value1: { value: 2.3 ; units "m/s" ; } value2: { value: 6.7 ; units "kg" ; }

CSV holds less information because value1 and units1 need further out-of-band information to be related to each other.

So - unless you can come with a robust statement about testability of information equivalence, it strikes me as a slippery slope with no huge value.

OTOH Making an explicit statement that Distributions may not be informationally equivalent seems quite valuable, and makes services equivalent with distributions logically consistent

@rob-metalinkage Member rob-metalinkage commented 5 hours ago further to that - if we know what profiles each distribution and/or services support, perhaps its up to the profiles to be described in a way that makes informational equivalence visible - for example maybe whats really required is a implementation resource to transform a profile into another profile.

Use Cases for reliance on information equivalence would seem to be missing - i think really you would need to find evidence for such a need.

@agreiner Member agreiner commented 5 hours ago You are right that CSV can offer less information than JSON, and is particularly likely to do so if there is relational information to be shared, though I would argue that your CSV example shows the relationship between the two values by including them on the same line. Clearly, one can publish informationally equivalent data in both formats, and one can also make the mistake of dropping information when translating from one to the other. One might caution publishers to avoid selecting CSV that drops relationships in any guidance document. One might also caution them against dropping out entire rows from a CSV, but one would not then assume that CSV needs to be treated as a form that is inconsistent in informational content. A little googling shows me two definitions of informational equivalence: (1) that information is equivalent if all the information in one representation can be inferred from the other, and (2) that information is equivalent if the same tasks can be performed with both. I don't claim to be expert in information theory (an MIMS degree notwithstanding), but this doesn't seem an intractable problem. (ref: https://books.google.com/books?id=A8TPF_O385AC&pg=PA66&lpg=PA66&dq=%27informationally+equivalent%27&source=bl&ots=fmVHmOjTXb&sig=sCSaAP1nfL8r-TKebXCNnZUvFyU&hl=en&sa=X&ved=2ahUKEwic5bef1tTdAhXzITQIHU_9C-0Q6AEwAnoECAgQAQ#v=onepage&q='informationally%20equivalent'&f=false).

@agreiner Member agreiner commented 5 hours ago I can think of several use cases for equivalence of informational content. If a two different users wish to avail themselves of data provided from an API, they may each have ingest tools already existing to handle data in different serializations. Neither would want to spend time reworking their tool to handle the other serialization. Another is the issue of reproducibility, comparing data from different analyses to determine whether one should expect them to find similar conclusions.

@rob-metalinkage Member rob-metalinkage commented 4 hours ago Do we have some conflicting perspectives @makxdekkers - i think somethwhere you argued that using DCAT 1.0 to catalogue the DCAT-AP and its distribution resources should be validly backwards compatible, but these resources are not informationally equivalent (if we agree either of the defs found by @agreiner are reasonable).

I think we would need to formalise the Use Case and agree on its requirements, and would need the existing approaches to be populated to show that there are cases we need to handle where we need to assert information equivalence. I think the general concern raised by @agreiner could be handled better by profile descriptions, particularly given the nuances of transformation that might exist in different contexts it would be hard to define a specific model and enforce it for all past DCAT usage.

@agreiner Member agreiner commented 4 hours ago Uh Oh, thinking this through a bit more, I'm starting to wonder what the difference between a Distribution and a DistributionService would really be. Both deliver a series of TCP packets that become a file when assembled back on the client's system. Both involve downloading something from a URI. One can build a simple REST API by simply posting json files under URIs that show the relationships between them. A REST API does in fact deliver representations of datasets that are transported as files. Hm.

@dr-shorthair Member dr-shorthair commented 3 hours ago DataDistributionServices like instances of OGC's Web Feature Service, do not appear to be the same as a Distribution, at least not to users. WFS accepts a query and responds with a file. These kinds of service have a long history, predating the REST theory.

I see that fully RESTful 'services' resemble distributions, because of the resource-oriented way that you address them. However, there is still a challenge in that the set of resources (distributions) available from many services is combinatorially large. Describing this set as a 'service' is at the very least a pragmatic solution to this - otherwise a catalog would be overwhelmed by the enumerated listing of the resources available from it. The 'service' in this case is the set of potential datasets/distributions that might be constructed through selection of the various query parameters.

davebrowning commented 5 years ago

This is a conversation that's been lingering on the edges for a while, so its really good that its surfaced.

If we ignore services for a moment and just consider the DCAT2014 style of use, having distributions that aren't equivalent immediately raises the question of how does a consumer choose between them? Its true that some formats make it easier to express certain characteristics than others, so there is a challenge for the publisher to be very careful here about what datasets (ie information) is really being published. In at a least one of the uses we have internally (not referenceable at this point, unfortunately), we've been looking at using dataset subset relations as way through this - I'll see if I can provide a coherant use case for this.

Once you introduce services that go beyond "download the whole thing" then we've looked at that as a dynamic subset - in effect a slice of some underlying distribution which varies according to the needs of the consumer. That allows the publisher to describe the whole dataset which might be downloadable in one form, or shareable via a cloud based bucket in another (for example) as well as providing access services on the same information via some interface. If the consumer uses a selection interface then they just get a subset, allowing the consumer to trade off completeness against ease of use (ideally). The consumer can rely on all access paths giving access to the same information - the selection control is in the hands of the consumer.

But to agree to @rob-metalinkage 's point we do need some documented use cases for this. Unfortunately our use of DCAT hasn't filtered through to our available/public services quite yet, but I'll see what I can do

makxdekkers commented 5 years ago

I understand that it's not that easy to say something about the content of distributions. My initial question was about a real-world case where distributions did not contain the same data, with distributions under one dataset containing data for different individual years. I do understand that maybe 'informationally equivalent' could be taken to mean exactly the same so it's maybe too strong.

makxdekkers commented 5 years ago

Could we try to use some examples? My understanding of 'informationally equivalent' was that distributions under one dataset should not contain distributions with different 'coverage' -- so all distributions for a dataset with temporal coverage 2010-2015 should contain data for the whole range of years, and not one for 2010, one for 2011 etc., as allowed for CKAN "resources". Likewise, a dataset of map data of a particular country should not have distributions that are maps of individual provinces. Where I feel uncertain is whether you could have distributions under one dataset that, while covering the same period/area/observation etc., have a different level of detail -- for example a dataset that is a map of a country having distributions in different map scales, e.g. 1:1.000.000, 1:100.000 etc.

dr-shorthair commented 5 years ago

The matter of 'information equivalence' of 'Distributions' has come up in a couple of conversations I'm having:

In DCAT I believe we encourage the view that the description of the Dataset captures all the semantics, and the description of the Distribution is merely serialization mechanics. All of these cases could be accommodated by that view, with (for example) each band of an image conceived as a distinct Dataset (as long as we provide a robust mechanism for relating datasets to each other). But this extra level of indirection complicates the mapping to actual running dataset catalogues, and would probably not be acceptable to the communities that run those.

I wonder if we need to take another look at this 'information equivalence' argument with these use-cases in mind.

It might be accommodated by introducing an alternative predicate to relate a Distribution to its Dataset - e.g. alongside

maybe also have

The latter would also require some semantic information on the distribution to describe which aspect (e.g. spectral-band, time-slice, spatial-tile, dimension) of the full Dataset is included in the particular Distribution.

makxdekkers commented 5 years ago

@dr-shorthair I really like the approach you're proposing. I can see it solving a lot of the problems I've seen -- including people attaching per-year data to a multi-annual dataset.

makxdekkers commented 5 years ago

@dr-shorthair Could it also work for granularities? E.g. the map of a region in different scales?

smrgeoinfo commented 5 years ago

maybe dcat:derivedDistribution for distributions that serialize (for spatial data) different scales or map projections, upsampling or down sampling. dcat:serviceDistribution would be an end point that supports parameterized request for filtering, subsetting, and maybe does thing like dynamic visualization. ERDDAP and THREDDS servers, OGC servers would support this kind of access.

I like the idea of an operational test for 'information equivalence' between representations A and B to be that one can transform A to B and the resulting B back to A, and get (with acceptable computational approximations for numeric data) the same A (all data elements are present and have same values),

One suggestion from a related conversation is that a distribution should include a property specifying the nature of the relationship of the offered serialization to a 'canonical representation', e.g. resampled, anonymized, reprojected.

dr-shorthair commented 5 years ago

@smrgeoinfo

dcat:serviceDistribution would be an end point that supports parameterized request for filtering, subsetting, and maybe does thing like dynamic visualization. ERDDAP and THREDDS servers, OGC servers would support this kind of access.

Is that the inverse of dcat:servesDataset, which links from a dcat:DataDistributionService to a dcat:Dataset - see topology in Figure 1 - see examples at https://w3c.github.io/dxwg/dcat/#a-dataset-available-from%20a%20service and https://w3c.github.io/dxwg/dcat/#data-service-examples

smrgeoinfo commented 5 years ago

Yes I think it would be

dr-shorthair commented 5 years ago

@smrgeoinfo the DataService class, and its sub-classes, is perhaps the key innovation in DCAT-rev. It seemed to most of us that an API is sufficiently different to a representation (serialization) that it was worth treating them separately.

Note that we have had comment elsewhere (from Clemens on the comments email list here) pushing back on this, primarily on the grounds that this implies an enlargement of scope of dcat:Catalog beyond just listing datasets.

(Link to email added by @davebrowning - issues #530 and #531 tracking further discussion/resolution)

smrgeoinfo commented 5 years ago

Isn't a dcat catalog just one more variety of registry? The important thing is information model for the descriptions of items in the catalog (i.e. the subtypes of resource). The interesting thing to me about dcat is not Catalog, but Dataset, and adding DataService is an excellent improvement. Maybe there is an argument that the vocabularies for describing different kinds of resources should be in different namespaces, but that's a stewardship question and given the small footprint of the dataservice extension (5 predicates and 3 objects) a whole new vocab would just be a lot of extra work.

I'd just like to see that the scope of DataDistributionService is not just APIs but needs to include web applications for slicing and visualizing a dataset like ERDDAP and THREDDS offer.

pwin commented 5 years ago

@smrgeoinfo - aren't these DAP/2 and similar sophisticated APIs for slicing and dicing? I don't think that the web application, the front end bit, should be included, whereas a specialised API should be in scope.

smrgeoinfo commented 5 years ago

Its just that those front end apps are built into the servers that implement the API, and I suspect (unfortunately I don't have hard data...) that a lot of users use those web apps to get the data they need, so its really useful to be able to provide links to them in metadata; currently in the ISO world the distribution is most commonly used for that. Any ERDDAP, THREDDS OPENDAP users out there want to comment?

agreiner commented 5 years ago

We use PyDAP as a handy way of making custom slices of HDF5 files shareable. It gets used particularly heavily by climate researchers. I agree that the collections we make available with that are well described as DataServices. The difference in my mind is that they are meant for human consumption rather than programmatic consumption.

smrgeoinfo commented 5 years ago

reviewing the current draft, I see that this issue is linked to a comment in the 6.7 Class:Distribution section that says

The intention of the phrase "informationally equivalent" needs to be clarified,
 in particular as different serializations may have different expressivity.

Looking back over the discussion (and assuming that we accept DataService as a valid resource type for a dcat:Catalog), Simon's comment (above) starts a good direction I think. There are several relationships between the dataset as a work, and the various ways data providers provide access:

  1. Information equivalent representations of a dataset as files, where equivalence is based on something like this transformation test: can Representation A be transformed to B, and that result transformed back to A to obtain a representation that functions identically for all applications of representation A. The simple Distribution class with a DownloadURL accounts for this.
  2. Component distribution: a dataset is represented in a set of files partitioned based on spatial, temporal, or thematic extent, e.g. tiles to cover a large geographic region, files containing annual data for a long-term time series, or different wavelength bands for a remote sensing image. In this case there should be some way to indicate the granularity (extent) of each component part and how to access each individually; could be a distribution for each part, or a URL template. Its not unusual for the metadata distribution link to point to an ftp directory, leaving it up to the user to figure out the file names to get what they need.
  3. Web services with query parameters that enable filtering, subsetting the data, or transformation dynamically.
  4. Derivative distribution: the source dataset is transformed in some documented process to produce a different packaged representation, e.g. temporal down sampling, different spatial scales or spatial reference systems, anonymization.
  5. End points that provide applications for human users to visualize/browse the data; could be an accessService, or perhaps better represented as a related resource (dct:relation) instead of a distribution?

If approaches 2, 3, 4, and 5 are treated as dcat:DataDistributionServices, then the dct:conformsTo, dct:relation, and dct:type properties could be used to provide the necessary information to distinguish these various kinds of distribution/access.

Note this issue is closely related to #145

dr-shorthair commented 5 years ago

@smrgeoinfo asked:

Isn't a dcat catalog just one more variety of registry?

Not quite. The way it is modelled the dcat:Catalog is essentially just the contents - a list of catalogued resources. It is a subclass of dcat:Dataset. Some governance and lifecycle factors are supported by the dcat:Records that are optionally included, but I would argue that a lot more is needed to make a registry.

Note also in Figure 1 that dcat:DiscoveryService is related to contents that it exposes by the value of its dcat:servesDataset property - this was one of the reasons for the subclass relationship between dcat:Catalog and dcat:Dataset.

andrea-perego commented 5 years ago

@makxdekkers said:

@dr-shorthair I really like the approach you're proposing. I can see it solving a lot of the problems I've seen -- including people attaching per-year data to a multi-annual dataset.

I think we must be very careful in not overcomplicating the use of DCAT. Making explicit the different level of granularity, accuracy, spatio-temporal coverage of distributions may be relevant in some specific use cases, but it is more common that metadata maintainers don't have this information.

I would be more in favour of extending the existing approach, by keep on using dcat:Distribution (as everybody is doing now), and by specifying in addition information concerning granularity, etc. (which is now done only at the level of dcat:Dataset). This would be more backward compatible, and it would make it possible to extend existing records accordingly.

makxdekkers commented 5 years ago

@andrea-perego That is a sensible alternative. There could indeed be an option to indicate spatial and temporal coverage for each Distribution if different from the coverage of the Dataset as a whole. For other types of granularity, accuracy, map scales, zoom levels, wavelengths, I don't know if there are any existing properties that we could include -- and maybe that goes too much into domain-specific aspects, and could be left to profiles.

andrea-perego commented 5 years ago

@makxdekkers , we are actually missing a way to indicate a number of aspects of data quality (here intended in its general sense), and indeed some of them can be considered as domain-specific, so it may be better in scope of profiles of DCAT. One of them is the Coordinate Reference System (CRS), which is one of the key pieces of information for geospatial data. This is supported in GeoDCAT-AP, where this information is normally specified at the level of dataset, but can be associated also with the distribution (in this case, the dataset is made available in different distributions, each using a different CRS).

For more generic aspects (precision / accuracy / level of spatial/temporal resolution), what we have at the moment is documented in some of the examples of the DQV spec (see https://www.w3.org/TR/vocab-dqv/#ExpressDatasetAccuracyPrecision), others could be possibly taken from RDF Data Cube.

dr-shorthair commented 5 years ago

@smrgeoinfo a consequence of the discussion in #432, and some other editorial work, the definition of dcat:Distribution has been tightened up.

Please look at https://rawgit.com/w3c/dxwg/dcat-issue432-simon/dcat/index.html#dcat-scope (the dot-points above Figure 1, and also the paragraphs below the figure) and also https://rawgit.com/w3c/dxwg/dcat-issue411/dcat/#Class:Dataset and https://rawgit.com/w3c/dxwg/dcat-issue411/dcat/index.html#Class:Distribution.

Does this address the concerns you raised in https://lists.w3.org/Archives/Public/public-dxwg-comments/2018Nov/0003.html ?

makxdekkers commented 5 years ago

Maybe I missed the discussion about this, but the new definitions in the bullet points above the diagram in https://rawgit.com/w3c/dxwg/dcat-issue432-simon/dcat/index.html#dcat-scope now say "dcat:X represents a description of a X" while 2PWD had "dcat:X represents a X". I think the 'a description of' is not correct. The RDF statements about X are a description in which dcat:X represents X.

smrgeoinfo commented 5 years ago

Yes the issue432 edits address the concerns from the list posting, thanks!

@makxdekkers, I guess the question is are the items in a catalog datasets or descriptions of datasets?

In the discussion below Figure 1 in the issue432 branch, the paragraph after 'A data service...' begins thus: Datasets and data services can be included in a catalog. A catalog is a kind of dataset whose member items are themselves the descriptions of datasets.

for consistency, that should read either "Datasets and data services can be included in a catalog. A catalog is a kind of dataset whose member items are datasets or data services. "

or "Descriptions of datasets and data services can be included in a catalog. A catalog is a kind of dataset whose member items are descriptions of datasets or data services. "

Either perspective can be made to work, it should just be consistent.

dr-shorthair commented 5 years ago

I agree with @makxdekkers that in the first mention (above the diagram) I went too far - 'represents' and 'description' essentially say the same thing, so doubling up makes it all a bit too 'meta' ...

@smrgeoinfo I'm inclined to go with the second formulation - the catalog is not a repository so it is not the things themselves that are found there, but descriptions of them. The various xxxURL properties take you to the things themselves.

smrgeoinfo commented 5 years ago

@dr-shorthair +1 on your suggestions

makxdekkers commented 5 years ago

Same here, +1 to @dr-shorthair

dr-shorthair commented 5 years ago

Following https://github.com/w3c/dxwg/pull/735 https://github.com/w3c/dxwg/pull/299 https://github.com/w3c/dxwg/pull/241 etc I think this issue is now dealt with

agreiner commented 5 years ago

So, where are we with regard to enabling a user to determine whether data they are sourcing online is the same data available elsewhere or not? Now that we seem to have an agreement that distributions are not necessarily informationally equivalent, how can a user determine whether one distribution is equivalent to another? This thread exposed at least three use cases relating to this question. 1. A user finds a catalog entry with multiple distributions. How do they decide which to download? 2. A user finds out about a colleague's use of data in one serialization and wants to obtain the same data in a different serialization. How do they know that they are getting equivalent data? 3. A scientist wants to reproduce work by another scientist but doesn't have access to the same data source. They find something that seems to be the same in a data catalog. How do they ensure that they are in fact using the same data?

makxdekkers commented 5 years ago

@agreiner My take on that is that, once we decided that distributions are not necessarily informationally equivalent, there is no way that you can make sure that it is exactly the same data. The default position would be that it is not the exact same data. If you want to be sure to use the exact same data then you need to use the same file. Would you suggest to create a property to link between two distributions (under the same dataset or under different datasets) that says "this is exactly the same data"?

andrea-perego commented 5 years ago

@agreiner , I have the impression that the use cases you outlined are (at least partially) related to what discussed in https://github.com/w3c/dxwg/issues/433 , so my guess is that they can be addressed accordingly - i.e., if a dataset is published for re-use and/or for reproducing an experiment, the contained distribution(s) are those needed to do the job, and which should be used for doing what is supposed to be indicated in the description of the distribution, and, in addition, by some properties (possibly providing machine-actionable information - e.g., the specification of the spatial/temporal resolution level, the reference system used, etc.).

Said that, the possible use cases are so heterogeneous that, IMO, this problem cannot be solved with metadata only. In many cases, the most effective and straightforward option is to get in touch with the data provider, and ask for support. This is of course not in scope with the work we are doing (although there's a BP in DWBP we could refer to), but it is one of the underestimated/neglected aspects of data publication (e.g., as far as I know, the FAIR principles do not mention this explicitly), and one of the main reasons why data are not re-used.

davebrowning commented 5 years ago

@agreiner, I think the 3 use cases you spell out are good tests of whether the WG is comfortable about the level of support we now have in the core vocabulary. We're trying to balance the variation we see across domains/participants in what are seen as distributions of the same dataset (specifically - a very wide variation) and the need in some domains for much more precision. The now rather long note on the vocabulary definition for dcat:Distribution does try to acknowledge that some publishers might commit to certain guaranteed behaviour (that all the distributions are informationally equivalent) but that in general that isn't the case.

I suspect any commitment that two distributions are the same data has to be clear who is making the commitment. In the first and perhaps second of your use cases its likely that it would be whoever is the publisher of the dataset. In the third case, a suspicious user might want a decent provenance chain (on both the metadata and the data itself) so that they can assess whether they are happy using it. Although its not explicit in the Provenance section of the document, the provisional alignment (turtle here) does have dcat:Distribution as a subclass of prov:Entity.

Does that go far enough?

agreiner commented 5 years ago

Since distributions can differ now in pretty significant ways, and since we are suggesting that publishers can use them for informationally equivalent resources (or not), it seems like we should be giving them a way to express that. Maybe something along the lines of @makxdekkers 's suggestion of an indication that one distribution is exactly the same data as another. Or it could be as simple as a boolean property that says "this is the complete dataset, not a subset."

rob-metalinkage commented 5 years ago

using dct:conformsTo could specify the profile - and hence information equivalence to that degree.

9or define a subproprty of it)

maybe we could use as a convention that a dataset is by default a trivial profile of itself, and S hasDistribution D dct:conformsTo S indicates the distribution is the full dataset?

(there is still the issue of the informationr profile of the distriubution and the distribuition mechanism conforming to a profile of a service API for examople - what is the actual domain of dct:conformsTo - its the "distribution" but that involves both access method and payload

dr-shorthair commented 5 years ago

It is always the prerogative of the publisher or catalog provider to decide how much detail to provide, and where to attach it.

davebrowning commented 5 years ago

As agreed/minuted in https://www.w3.org/2019/02/27-dxwgdcat-minutes#x08, the base issue here is to be closed as having been addressed in the various pull requests listed here.

If there is a strong use case for support of complete/partial distribution semantics within the core vocabulary then either we can re-open this or (preferably) open a new issue. As it stands this appears to be an issue best addressed using profiles.