Model Series of Data as Distributions of a single Dataset

w3c / dxwg

Data Catalog Vocabulary (DCAT)

https://w3c.github.io/dxwg/dcat/

Other

144 stars 55 forks source link

Model Series of Data as Distributions of a single Dataset #1429

Closed sabinem closed 2 years ago

sabinem commented 2 years ago

In the Swiss DCAT profile we are modeling Dataseries such as yearly elections in a single dataset, where each distribution contains the election data of one year. I know that is considered an antipattern by DCAT. And I also know that all properties of Distributions are designed in such a way, that it is assumed that the all Distributions of a Dataset have comparable content.

But there is also a sentence at https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution:

Nevertheless, the question of whether different representations can be understood to be distributions of the same dataset, or distributions of different datasets, is application specific. Judgement about how to describe them is the responsibility of the provider, taking into account their understanding of the expectations of users, and practices in the relevant community.

I was recently surprised that we are not the only profile modeling a series of data with Distributions, and that there is a certain resistance to give this up this pattern, since the overall impression is, that this would make for too many datasets and the data portals would get harder to mangage with too many datasets.

On the other hand if DCAT would approve of that pattern or antipattern, then properties would be needed to describe the content in each Distribution. In the Swiss profile we added the attribute dct:coverage to Distributions, with a domain of either spatial or temporal values.

I am just curious what DCAT's opinion on this topic is and I want of launch a discussion about this: Can't it be that sometimes Distributions differ in their content and shouldn't DCAT alos support these use cases with appropriate properties?

makxdekkers commented 2 years ago

Thanks @sabinem , very good points.

I have seen at least three national implementations of data catalogues that take this approach with different files in temporal and/or spatial series as distributions of one dataset.

What does worry me is that it makes it hard for reusers -- who harvest or otherwise receive DCAT descriptions -- to understand what's going on if there are some dataset/distribution combinations that follow the recommended pattern of 'distributions contain the same data' and some that follow what is now the anti-pattern.

Would it be sensible to distinguish the pattern by using different classes? E.g.:

for the pattern, use dcat:Dataset and dcat:Distribution
for the alternative use new classes dcat:Series and dcat:SeriesItem, each with their own sets of properties

That way, it would be immediately obvious what pattern is being used.

smrgeoinfo commented 2 years ago

I think @makxdekkers suggestion is on the right track, recognizing that there are (in this case) two kinds of datasets:

Static datasets that do not change after publication, e.g. results of chemical analyses, a set of observation data at a particular place and time.
'Series'-- dynamic datasets that grow incrementally in space or time. E.g. as series of geologic maps that add new map sheets periodically, a time series of groundwater levels for a well, statistics for a survey that is repeated monthly...

KInds of distributions:

download a file. Data content has a fixed spatial, temporal, thematic extent and schema scope. These might be static datasets (sense 1 above) or snapshots of particular extents in a 'series' (sense 2 above)
download subset of a file or series: data is large, or updated at some interval. Distribution interface provides options to subset data thematically, spatially, temporally, or schematically (choose fields).The source might be a single large static file, or a dynamically updated series.

What is a 'seriesItem'; I'd propose this a static snapshot file from a dynamic series (case 1 + 1 above)) The analog for a static dataset would be a 'filtered subset'

sabinem commented 2 years ago

@makxdekkers I think your suggestion is very good, since usually datapublishers are very aware of their use case and whether or not it is a Series with Series Items or a Dataset with same content Distributions, just as @smrgeoinfo also describes in the mentioned use cases. To have a vocabulary in place that allows to translate that awareness of the use case into the appropriate vocabulary seems like a good choice and will also help users to quickly understand the structure of the data.

riccardoAlbertoni commented 2 years ago

It might be worth noting that the current DCAT Editor Draft and the second DCAT working draft acknowledge some flexibility on what to consider as items of dcat:datasetSeries, which already includes the use of Distributions in place of Datasets.

Indeed, the property dcat:inSeries, which links the items to data series, has no domain specified, and its usage note says

Normally, child datasets in dataset series are represented as dcat:Dataset. The use of dcat:Distribution for typing child datasets is however recognized as a possible alternative, whenever it addresses more effectively the requirements of a given application scenario.

I think we can distinguish between informatively equivalent and non-informative equivalent distributions using the properties dcat:distribution and dcat:inSeries. I would expect the informative equivalence holds between distributions of the same dataset, i.e. the distribution linked to the same dataset via dcat:distribution, not between distributions used as items in a dataset Series, i.e., distributions linked via dcat:inSeries to a dcat:datasetSeries.

Said that... the current dataset series section does not mention the cases in which distributions are used in place of datasets. I guess a couple of examples more might help to understand to what extent the current design meets the emerging use cases.

makxdekkers commented 2 years ago

@riccardoAlbertoni This is indeed useful information. If this pattern is already foreseen, more information should be provided in the specification showing how to do this. I do see that this pattern is mentioned in the information about the property dcat:inSeries, but it should really also be described with the class dcat:DataSeries.

For one thing, the definition of the class says that the dataset series is a "collection of datasets ..." which should then say "collection of datasets or distributions ..." And the definitions of the property dcat:inSeries should be changed from "dataset series of which the dataset is part" to "dataset series of which the dataset or distribution is part", or even "dataset series of which the resource is part", given that the domain is left open -- so anything at all can be a member of a dataset series.

As far as I have understood, the pattern with dcat:DatasetSeries --> dcat:Distribution may also require that additional properties should be mentioned in the information about the distribution, for example dct:temporal and dct:spatial. In a way, such distributions become very similar to datasets which means that you may need any of the 'specific properties' for dataset to be used on the distributions, or even all the properties mentioned for dcat:Resource?

Another issue for me is that I think that, in general, having an 'or' in the definition might pose problems for processing. If both patterns are available, i.e. dataset series of datasets and dataset series of distributions, an application that receives such information will have to look for both datasets and distributions that link to it, and might need to take different actions in either case -- and what happens if there are both datasets and distributions linking to it?

It would indeed be good to develop some examples for relevant cases.

smrgeoinfo commented 2 years ago

Here's a couple sketches trying to elucidate some of these relations: DatasetSeriesSubset

Packaging

DatasetSeriesSubset Diagram:

Dataset: A collection of data, published or curated by a single agent, and available for access or download in one or more representations, and containing information conforming to some schema. NOTE: identity of dataset is based on the underlying schema, and other variable criteria like authorship, coverage extent, update version.

DataSeries: a collection of datasets sharing the same schema, but differentiated based on some extent criteria like temporal or spatial coverage. The 'member/inSeries' link from a Series to a Dataset is an association class that specifies parameters determining the extent of the series member.

ContentModel: a schema (conceptual, logical, or physical) that characterizes a dataset; defines entities, properties, domains, ranges, and other constraints for elements in the dataset.

Distribution: A specific representation of a dataset. A distribution has a Serialization based on some electronic format and profile that determines how that format is used. The serialization for a distribution must implement the schema for the dataset that is represented by the distribution.

PackagedSubset: a dataset that is subset from a sourceDataset based on some query and parameter values for that query; its content is fixed, and can be assigned an identifier.

FilteredDistribution: a representation of a subset of a Dataset based on some query and parameter values for the query, determined dynamically by a user requesting the data through some interface. Can be assigned an identifier to duplicate the query, but if the source dataset is updated, the actual content might vary over time.

Serialization: a scheme for representing information electronically; based on some format (specified by a MIME type), with optional additional constraints on the format for greater specificity in content, e.g. XML schema, RDF vocabulary used, CSV profile.

parameters: values that specify criteria in query to define a dataset subset, or that define the extent (temporal, spatial, other...) of a particular DataSeries member. The associated downloadURL could be a URITemplate in which the parameters would be substituted.

Packaging Diagram: Bundle: a collection of files that are associated with a 'Dataset' search result, e.g. DataOne, CKAN, MGDS. Includes at least one Dataset distribution and one other file that is related, but not a distribution.

Document: a file that is related to some dataset and included in a Bundle.

Package: a file that contains all the items in a bundle, e.g. a BagIT or ORE archive file.

jakubklimek commented 2 years ago

I have to say I am very sceptical about allowing Distributions of a Dataset to be informatively non-equivalent, and on top of that, members of DatasetSeries.

In my experience, the prevailing argument for allowing informatively non-equivalent distributions of a dataset, also mentioned in this thread, is "there would be too many datasets". However, to be able to properly describe the distributions, which would now be informatively non-equivalent, one would have to use many of the properties now used for describing datasets also for distributions, as @makxdekkers also points out. This would again result in having many objects, only now it would not be Datasets in a Dataset series, but Distributions of datasets, described as datasets, in a dataset series. And this time, "there would be too many distributions". I do not see the advantage in that.

I think the potential number of datasets is actually not a problem. It is simply a manifestation of the state of things, and it is up to the user interfaces presenting the data to people to handle this, e.g. by grouping by topic, publisher, time, space, etc.

I may be wrong, but it always seemed to me that informatively non-equivalent distributions were an artifact of 1) the missing specification for a dataset series - the publishers were therefore using them to "group" related files or 2) publishers trying to avoid properly describing each file separately

But now that we have the dataset series, this situation should be model as a dataset series made of individual datasets, properly described, served in informatively equivalent distributions. Otherwise, there will be too many options to do the same thing, resulting in interoperability issues.

I may be wrong here, but I think I have not seen a case for informatively non-equivalent distributions that could not be solved by using informatively equivalent distributions of a dataset in a dataset series.

matthiaspalmer commented 2 years ago

We have identified a similar problem in Sweden to what @sabinem described. In short, we need to make it easy for people to add more data into an existing dataset.

But we have solved the problem in another way in the Swedish profile. We have allowed the dcat:downloadURL to be repeated. Like this:

ex:dataset1 a dcat:Dataset ;
     dcat:distribution ex:distribution1, ex:distribution2 .

ex:distribution1 a dcat:Distribution ;
     dcterms:title  "Access via CSV files" ;
     dcat:downloadURL  ex:file1,ex:file2 .

ex:file1 dcterms:title "Budget 2019" .
ex:file2 dcterms:title "Budget 2020" .

ex:distribution2 a dcat:Distribution ;
     dcterms:title "Access via a JSON based API" ;
     dcat:accessURL ex:API

This approach has the following merits:

The distributions are comparable, they contain the same data.
The amount of duplication of metadata is minimal
It is relatively easy to explain to data providers what to do
Providing multiple distributions (e.g. API and file based access) is obvious
More information can be provided on each file if there is a need.
There won't be any unneccessary pollution in dataportals of many datasets

It could be argued that repeating the dcat:downloadURL is bad, that it is not intended to be used that way. The specification says "the downloadable file" which indeed seems to indicate there should be a cardinality of one.

However, I think it should be investigated if it can easily be tweaked to be compliant. For instance, allowing a dcat:downloadPartURL as an alternative to dcat:downloadURL, maybe introducing a class like dcat:File and just suggesting that it is allowed to provide a dcterms:title on it.

I think the approach above should be considered as a more lightweight alternative to the dataset series approach. It is clear that in some situations people really have data that they want to highlight as independent Datasets and still indicate that they are in a series. Hence, the Dataset series is needed, but from what I have seen in Sweden the lightweight approach is something that would be used (is already) much more often.

makxdekkers commented 2 years ago

@matthiaspalmer Thanks for the detailed information of your solution.

Not questioning at all that this approach fits your needs and the needs of your data providers, I am still a bit uneasy about all the different variants. Section 12.3 in DCAT3 mentions two 'legacy' approaches:

The dataset series is typed as a dcat:Dataset, whereas its child datasets are typed as dcat:Distribution's.
Both the dataset series and its child datasets are typed as a dcat:Dataset's, and the two are usually linked by using the [DCTERMS] properties dcterms:hasPart / dcterms:isPartOf.

You now outline yet another approach.

One of the main problems I see with all these different solutions is that, while they obviously make absolute sense for data providers in a particular environment, it makes it very hard for data consumers to understand what is happening. It seems to me that a data harvester needs to program quite a bit of logic to process these various approaches, and then still needs to do something smart to present data from various source in a coherent way.

As far as I see it, the approach with dcat:DataSeries tries to create a more coherent and widely interoperable approach so that life becomes a lot easier for data consumers.

matthiaspalmer commented 2 years ago

I understand your concern @makxdekkers, but at the same time I am just reporting what kind of needs I have observed. I am also of the opinion that it is better to adapt the model to the world rather than trying to fit the world into the model.

I think the key would be to provide good guidance when to use the Dataset series and when to use dcat:downloadPartURL. For instance, you need to use dataset series if you need more metadata than a title of the file or when the file does not follow the same structure, e.g. when tabular data does not have the same columns in every file.

To be frank, if the Dataset series approach is the only way forward (together with the legacy dcterms:hasPart option) I am confident that the following will happen as soon as the model will be accepted (at least in Sweden):

We will have to hide the complexity of the model for data providers in tools that strive to be compliant (e.g. EntryScape).
We will have to hide datasets in a dataset series from the direct search results in dataportals.

If we do not do 1, people will just add one distribution per file, just like the antipattern @sabinem described. We have had half day workshops for nearly all new customers and try to instruct them to NOT do this. Still it is happening all the time, it is an uphill battle. I fear that with the Dataset series being the only alternative the battle will be even harder (unless we provide a simplified "hidden" solution for the multiple file case).

Another option is to diverge from the model, keep the existing approach and do a transform when exposing to the European data portal, but this seems suboptimal.

matthiaspalmer commented 2 years ago

I would also like to point to the discussion we already had in september 2019 about this approach. Although at that point the discussion was postponed to DCAT3: https://github.com/w3c/dxwg/issues/868#issuecomment-532989569

matthiaspalmer commented 2 years ago

I am curious, how would you handle a dataset that can be either accessed via 10 files or via a single API (containing all the data) as a dataset series?

Would it be one Dataset series with a single distribution corresponding to the API and then have 10 datasets that all point to the dataset series via dcat:inSeries?

If this is the case, should it then not be stated somewhere that a Dataset series distribution should have the same content as the sum of all the datasets in contains?

matthiaspalmer commented 2 years ago

I am curious about the 'legacy' option number 1 in 12.3 you pointed to @makxdekkers . It seems to me that it is the same as the antipatern @sabinem described and @jakubklimek argued against.

The final statement in that section reads "These options are not formally incompatible with DCAT" somehow legitimizes that distributions need not contain the same data. I am suprised by this statement and I think it is in direct conflict with the definition of distributions in 6.8.

makxdekkers commented 2 years ago

I am curious about the 'legacy' option number 1 in 12.3 you pointed to @makxdekkers . It seems to me that it is the same as the antipatern @sabinem described and @jakubklimek argued against.

Yes it is.

The final statement in that section reads "These options are not formally incompatible with DCAT" somehow legitimizes that distributions need not contain the same data. I am suprised by this statement and I think it is in direct conflict with the definition of distributions in 6.8.

But the sentence continues: "so they can cohexist with dcat:DatasetSeries during the upgrade to DCAT 3", which seems to imply that applications that do it this way are expected to upgrade to DCAT3 and then move to the approach with dataset series.

But I agree you could read this in several ways, i.e. "if they upgrade" or "when they upgrade".

makxdekkers commented 2 years ago

Another option is to diverge from the model, keep the existing approach and do a transform when exposing to the European data portal, but this seems suboptimal.

The way I see it, there are two sides to this:

if data providers all do their own thing, i.e. create metadata according to their own interpretation of a standard, the burden of converting the various approaches to something coherent lies with the data consumer
if data providers, doing their own thing locally, map their interpretation to a common interoperable approach, the data consumer can rely on receiving consistent metadata from anywhere.

It might be that option 2 is the most efficient as the data provider has all the information about both the internal approach and the common interoperable approach and therefore can make the best mapping.

In option 1, the data consumer might wonder what the purpose of using a standard is, if all data providers do things their own way in any case. The data consumer would need to keep knowledge of all existing variants to be able to process the information.

makxdekkers commented 2 years ago

I understand your concern @makxdekkers, but at the same time I am just reporting what kind of needs I have observed. I am also of the opinion that it is better to adapt the model to the world rather than trying to fit the world into the model.

That is a good point, but in this case, it might mean that if the world is a mess, the standard model should reproduce the mess. Or do you mean that DCAT should implement your model as the only one?

This is a fundamental question with standardisation. Either the "world" aligns with a standard so that everybody knows what to provide and what to consume, i.e. interoperability, or the standard aligns with the world, in such a way that everybody can continue to do what they like, and there is basically no benefit of using a standard.