w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
151 stars 47 forks source link

DCAT: Proposal for an updated definition for the concept “dataset” #1195

Open aidig opened 4 years ago

aidig commented 4 years ago

DCAT: Proposal for an updated definition for the concept “dataset”

Background and problem statement

This is a joint proposal for an updated definition for the concept “dataset”, made by the Danish Agency for Digitisation [1] (member of W3C) and the Danish Agency for Data Supply and Efficiency (member of OGC and, through Danish Standards, member of ISO/TC 211).

The Danish government wants to describe data consistently across authorities, processes and IT systems [3]. In order to achieve this, definitions of the same concepts must be aligned where possible, not only within domains but also across domains.

“Dataset” is used in many different domains and is a highly relevant concept these days, e.g. in the context of the European INSPIRE Directive [4] and the PSI Directive [5]. The Agency for Digitisation and the Agency for Data Supply and Efficiency therefore request W3C, OGC and ISO/TC 211 to come to an agreement regarding using the same definition and notes for “dataset” and submit a first draft for discussion.

Proposal for updated definition and related notes

dataset collection of data that is regarded as a unit

Data catalog specific notes Geographic information specific notes
Note 10 to entry: In the context of DCAT, a dataset is published or curated by a single agent. Note 10 to entry: In the context of geographic information, a dataset can be a smaller grouping of data which, though limited by some constraint such as spatial extent or feature type, is located physically within a larger dataset. Theoretically, a dataset can be as small as a single feature or feature attribute contained within a larger dataset.
Note 11 to entry: In the context of DCAT, a dataset is available for access or download in one or more representations. Note 11 to entry: In the context of geographic information, a hardcopy map or chart can be considered a dataset.

Examples of datasets that are not “typical” according to the notes for “dataset”:

In order to highlight the need for all the notes to the definition of “dataset”, hereby a list of examples that are not “typical” according to those notes. The examples are not intended to be included in any standard, but are meant as a basis for the discussion. The examples follow the same numbering as the notes above.

Example 1: Dataset that is too large or complex to analyze with the current technologies, but that might be useful when technology evolves. Example 2: A temporary dataset, such as the result of an SQL query. Example 3: A planned dataset, that is not yet available or collected. Example 4: OpenStreetMap data; in general: data collected via crowdsourcing. Example 5: A use case where data from many different domains (different subjects) is combined to solve a particular problem or need (one purpose). Example 6: GeoPackage containing vector, raster and styling; EU-dataset where some of the data are INSPIRE-harmonized and some are not. Example 7: Data collected at the local level in a country, and then aggregated to one dataset containing data for the whole country; data sent in to EEA by the EU member states, and which is then aggregated to one dataset. Example 8: Dataset containing data having different licences; dataset created by taking subsets from different other datasets.

References

makxdekkers commented 4 years ago

@aidig I really like your attempt to firm up the definition of "dataset". The current definition "A collection of data, published or curated by a single agent, and available for access or download in one or more representations" can indeed be too encompassing for many practical applications, although it already includes your notes 3 and 4. However, in my opinion your notes 1 and 5 through 8 are too restrictive -- even if in many cases they would hold, I don't feel we should make them part of the definition. I do fully agree with your note 9 -- it basically says that it is the publisher who decides what a dataset is. One of the problems that we faced in the development of DCAT (version 2014) was that as soon as we tried to make the definition more specific, there was always someone who brought up an example of something that did not fall under the definition but could still usefully be described as a dataset. In a way, your notes and your examples of what is not a dataset may be particularly relevant for your domain; could they be part of a domain-specific application profile or of domain-specific guidance?

aidig commented 4 years ago

The notes are meant to illustrate typical - but not defining - characteristics of datasets, and the related examples are thus presented as argumentation for why these typical characteristics should not form an integral part of the short formal definition as it quite rightly would be too restrictive. (Like the current definition could be perceived as stating excluding conditions) All the notes begin with "Typically..." and are to be considered supplementary comments to the definition.

aidig commented 4 years ago

And for clarification, the examples are not of "what is not a dataset", but rather atypical datasets that can be discussed. Perhaps we should have highlighted this important point in the text. (Headline changed to "Examples of datasets that are not “typical” according to the notes for “dataset”"

rob-metalinkage commented 4 years ago

As the OGC is a member-driven organisation I cannot make a definitive statement at this stage about what process or steps the OGC can and should take, however as the OGC staff member managing the OGC Definitions Server (Registry Infrastructure) [1] I can offer to participate in these discussions and seek to implement appropriate standards.

The Definitions Server delegates governance to artefacts managed by OGC working Groups - and uses a consistent view of definitions based on SKOS - so at this level the "Dataset" is a SKOS ConceptScheme - although single standards may include models and codelists - so the concept of relationships between Datasets is a significant one. I plan to implement DCAT metadata views using the Content-negotiation-by-profile mechanism for these datasets.

The trigger to implement DCAT views is a DCAT profile that specifies a canonical set of relationships between Datasets - I could implement this now in OGC controlled namespace using the PROF vocabulary and SHACL - but such a profile has a wider scope than the OGC, so we're at the "wait and see" stage.

The second part of the puzzle is how OGC specifications refer to the concept of Dataset - and the opportunity here is to publish mappings between terms uses in OGC standards and other environments., such as the DCAT profile postulated.

In 2020 I will be actively seeking to instigate a community of practice across standards organisations to establish interoperability for definitions publication infrastructures. I believe your concerns could be met through profiling DCAT for distinctly different Use Cases, and using hierarchies in the profile models to identify what is common and what is distinct. In this context the loosening or clarification of the underlying definition makes sense, and can be informed best by establishing the set of relatively simplified profiles needed to describe the cases of interest. OGC can implement these profiles simply by adding them to the knowledge graph in the definitions server and making appropriate assertions about how OGC defined definitions relate to them.

[1] https://www.opengeospatial.org/def-server

makxdekkers commented 4 years ago

@aidig It seems to me that the new definition you propose "collection of data that is regarded as a unit" is even vaguer that the current one -- as you have moved the part of the current definition "published or curated by a single agent, and available for access or download in one or more representations" to the notes (3 and 4), changing these aspects from applying to all datasets described using DCAT to being only typical for datasets. Furthermore, the fact that the collection of data is "regarded as a unit" is already implicit in the fact that the collection is described as a dcat:Dataset, which is a unit. As I wrote before, your observations about what is and what isn't typical for datasets may apply to datasets in your domain or application, and as such could very well be part of domain-specific guidance.

aidig commented 4 years ago

Yes, the proposed definition is less restrictive as it can be argued that the current definition does not capture all instances of possible datasets (see the atypical examples).

Aklakan commented 4 years ago

For strict semantics, the definition "a dataset is an instance of a data model" is the best I could find to date. If the data model is formally specified, one can verify whether a dataset conforms to it. For example, with RDF we are even in the fortunate position that equivalence of datasets is defined, so for two conforming datasets there is a well-defined procedure to determine equivalence.

The problem right now is - at least to my understanding - is, that dcat:Datasets cannot be linked by owl:sameAs because the identity of dcat datasets includes the authority that publishes it. So even if the exact same sequence of bytes was published by different authorities, they can never be the same dcat:Dataset.

A maybe better approach in the future would detach the content from the record that describes a dataset as published by some authority. For example, publisher A distributes some content (let's assume a set of triples) as a single download URL, and publisher B re-publishes the same content partitioned by the RDF predicate, this could then be expressed as:

#x: = Namespace for future extensions

datasetXPublishedByA a dcat:Dataset ; dct:publisher A ;
  x:content contentXByA .

datasetXPublishedByB a dcat:Dataset ; dct:publisher B ;
  x:content contentXByB .

contentXByA a x:Content ;
  dcat:distribution [ dcat:downloadURL <everything.ttl> ] .

contentXByB a x:Content ;
  dcat:distribution [
    # Merge all content of the union members according to the data model,
    # and one obtains the distribution as a single downloadURL 
    a x:UnionDistribution ;
    x:partitionPredicate rdf:type ;
    x:qualifiedMember [
      x:partitionValue foaf:Person;
      x:dataset [ a dcat:Dataset ; x:content [ dcat:distribution [ dcat:downloadURL <foaf-person.ttl> ] ] ]
    ] 
    x:qualifiedMember [ x:partitionValue dbpedia:Place ... ]
  ]

This for example allows to safely express: contentXByA owl:sameAs contentXByB, as the content denotes a specific instance of the data model, which is a specific set of triples.

So thinking about this, I suppose I am actually proposing to factor out a dcat:Content from dcat:Dataset in the future; whereas a dcat:Dataset as it stands combines distributions of roughly similar content with a publisher.

rob-metalinkage commented 4 years ago

@Aklakan there is an alternative approach - dont use owl:sameAs for datasets - which are records about actual data, fron the point of view of a catalog - so the same data would have different records in two different catalogs.. what might be the sameAs are distributions - if they are both cataloguing the same access point for the data.

Aklakan commented 4 years ago

@rob-metalinkage Yes, we have an agreement that owl:sameAs does not work. However, I am not sure if your statement 'datasets are records about actual data, from the point of view of a catalog' is really correct. DCAT distinguishes the concepts of dcat:Dataset and dcat:CatalogRecord - and this distinction makes sense.

So as I see it, a dcat:Dataset actually more relates to the concept of '(a unit of) content that was published by a (single) authority'. The nature of the content may be as abstract as 'the sequence of images that makes up the Lord of the Rings movie'. There is freedom here, but when formal data models are involved, this can be made much more concrete. So if this is what the dataset is about, then different distributions should be descriptions of concrete technical aspects, most prominently structure and access mechanisms of this idea, such as files with varying image quality. The CatalogRecord then has information when a dataset was made available in a catalog.

I considered that owl:sameAs could be applied on the distribution level, but I tend to think that the identity of distributions is tied to technical aspects and structuring of the content. For example, I would not consider distribution of a file via a torrent to be the equivalent to a distribution as a HTTP URL or distribution via a GIT URL. I'd rather say that the content (in the abstract sense - not in the sense of syntactic representation or access mechanism) in such cases was equal. (So maybe a generalization of DCAT was C(ontent)CAT)

As for the examples of 'what is not a dataset', I also tend to disagree - every electronic resource is eventually a sequence of bytes and thus data. That's why HTTP has the content type which tells a client how the bytes are to be interpreted - in the worst case this really is application/octet-stream.

makxdekkers commented 4 years ago

Just a word of warning: we seem to get into discussions about how to better define "dataset" every once in a while -- we did have a long discussion during development of DCAT2014 -- and we have never found a better one. When making it more general, allowing it to be anything at all, i.e. indistinguishable from rdfs:Resource, we lose the idea that someone should have responsibility for it, which I think is not a good idea. When trying to make it more specific, we usually end up with someone mentioning a kind of data that falls outside the proposed definition but could still be described as a dataset. In this particular case, suggesting that there is a data model that underlies the data, what about if there is no model, e.g. for unstructured data or raw data? We run the risk of then having to agree what we mean by data model. And the sentence "the content (in the abstract sense - not in the sense of syntactic representation or access mechanism) in such cases was equal" actually describes the consensus about the relationships between the distributions of a dataset, although we relaxed it a bit by not requiring full information equivalence in all cases.

agreiner commented 4 years ago

I think the tricky thing about this question is that there are many ways in which datasets can be equivalent to each other that may be trivial for one use case and crucial for others. Being able to say that two datasets are the same dataset could mean that their data files have the same checksums, or that they differ only in format, or or that they differ only in metadata, or that they differ only in translating from one unit of measure to another, etc. I think it could be useful to define terms that describe the different types of equivalence, if there were a tractable list of useful equivalencies.

Aklakan commented 4 years ago

what about if there is no model, e.g. for unstructured data or raw data?

Then its still text/plain or application/octet-stream - no? So data without any reference to a model does not exist imho. There always has to be some language in which the data is represented - even if this 'language' just happens to be sequences of bytes. If an instance of a text or binary document is equal to the one described by a distribution of a dcat:Dataset, then it is highly likely -- but not mandatory -- that we are talking the same dataset. A distribution is technical - in the simple case it points to a document with a concrete syntax.

But if you consider an RDFa file, although it is XML, it can be interpreted in many ways: is it text? XML? (X)HTML? RDF? The meaning has to be specified on the dataset level: If the dataset is about text, then mixing distributions with content types application/pdf, text/plain, application/msword and application/xhtml+xml are a reasonable choice. If the dataset is about triples, then mixing distributions of application/xhtml+xml with text/turtle is certainly valid - as the former should then be assumed to contain the same RDFa annotations as the turtle document - however application/pdf would not make sense (unless there was a standard to encode triples in pdf).

So the specification of a dataset also constrains the possible interpretations of distributions/concrete syntaxes, because the interpretation has to conform to the specified model.

makxdekkers commented 4 years ago

@Aklakan As I wrote, we would be getting into a discussion of what is a model. You call text/plain and application/octet-stream 'models' which I find a stretch.

In any case, the subject of the issue was discussing an update to the definition of "dataset" but it seems to have morphed into a discussion about 'equivalence' of distributions. See also https://github.com/w3c/dxwg/issues/52.

dr-shorthair commented 4 years ago

My hunch is that this will be difficult/impossible to write a water-tight rule around. In which case, I think the best we can do is (a) offer some guidance, probably in the form of examples (b) trust the provider to make a sensible judgement about what is a 'dataset' to their potential audience.

Remember, DCAT is aimed at a very general set of use-cases. The catalogue-dataset-distribution backbone is an important and useful pattern but might not map perfectly onto every useful application.

aidig commented 4 years ago

In the following document, we provide further elaboration on this issue. It is currently in the form of a draft OGC Discussion Paper, see https://github.com/heidivanparys/discussion_paper_dataset/releases/tag/v20200306

rob-metalinkage commented 4 years ago

A few comments

from the DCAT perspective, the comments about identifiers are most relevant, in that DCAT relates to making statements about datasets - without an identifier this doesnt make sense.

The key issue here is: "Note 2 to entry: Typically, a dataset is described using metadata elements including an identifier and a title." but that is mixing up the metadata, identifier an title concerns. IMHO these should be separated, and identity treated with sufficient detail to make it clear what the expectation is about endurance/perdurance: is it a community of practice decision to decide how identity evolves for a data set - i.e. tie this into versioning, but perhaps not try to over define this.

heidivanparys commented 3 years ago

FYI: there is a workshop this week on how to achieve a shared understanding of concepts across domains, see https://sharedconcepts.github.io/. This issue is a good use case: it is not easy to change established definitions, for several reasons. However, a possible way forward would be to instead create and publish mappings between concepts, such as the different notions of "dataset" in different organisations.

agbeltran commented 3 years ago

FYI: there is a workshop this week on how to achieve a shared understanding of concepts across domains, see https://sharedconcepts.github.io/. This issue is a good use case: it is not easy to change established definitions, for several reasons. However, a possible way forward would be to instead create and publish mappings between concepts, such as the different notions of "dataset" in different organisations.

@heidivanparys will the dataset definition be discussed in the workshop? I see a session on dataset quality, but not sure if the dataset definition will also be discussed in detail. Thanks

agbeltran commented 3 years ago

Looking to address the original issue by @aidig (and considering the subsequent discussion):

dataset collection of data that is regarded as a unit

For DCAT, the unit is the class dcat:Dataset itself. Perhaps the definition could be changed to:

"A collection of data regarded as a unit, published or curated by a single agent, and available for access or download in one or more representations".

  • Note 1 to entry: Typically, a dataset is collected for a certain purpose.

  • Note 2 to entry: Typically, a dataset is described using metadata elements including an identifier and a title.

IMO, this is a given by the properties of the class, so probably doesn't need any further clarifications.

  • Note 3 to entry: Typically, a dataset is available for use in one or more representations.

This is already mentioned in the first usage note.

  • Note 4 to entry: Typically, a dataset is published or curated by a single agent.

This is already in DCAT definition.

  • Note 5 to entry: Typically, the data in a dataset are related through a common topic.

  • Note 6 to entry: Typically, the data in a dataset have the same syntactic structure.

  • Note 7 to entry: Typically, the data in a dataset are managed using the same governance processes.

  • Note 8 to entry: Typically, the data in a dataset have a shared data provenance.

@makxdekkers mentioned earlier, and I agree, that these seem too restrictive to include in the definition. Those interpretations are possible with the current DCAT vocabulary (considering datasets and distributions).

  • Note 9 to entry: The arrangement of data in one or more datasets is a decision, based on formal requirements or informal considerations.

I suggested adding another usage note highlighting this point.

Data catalog specific notes Geographic information specific notes Note 10 to entry: In the context of DCAT, a dataset is published or curated by a single agent. Note 10 to entry: In the context of geographic information, a dataset can be a smaller grouping of data which, though limited by some constraint such as spatial extent or feature type, is located physically within a larger dataset. Theoretically, a dataset can be as small as a single feature or feature attribute contained within a larger dataset.

I don't think we need to add anything to the definition or usage notes to address this specific case. Examples indeed could be added for this and other points.

Note 11 to entry: In the context of DCAT, a dataset is available for access or download in one or more representations. Note 11 to entry: In the context of geographic information, a hardcopy map or chart can be considered a dataset.

I think that the specific use case can be addressed by accessing the dataset, so I don't think a clarification is needed.

What do people think about this proposal?

heidivanparys commented 3 years ago

FYI: there is a workshop this week on how to achieve a shared understanding of concepts across domains, see https://sharedconcepts.github.io/. This issue is a good use case: it is not easy to change established definitions, for several reasons. However, a possible way forward would be to instead create and publish mappings between concepts, such as the different notions of "dataset" in different organisations.

@heidivanparys will the dataset definition be discussed in the workshop? I see a session on dataset quality, but not sure if the dataset definition will also be discussed in detail. Thanks

@agbeltran The dataset definition will be used as an example of a widely used concept that has different definitions in different organisations in the presentation "Concepts - How to describe and harmonize them (Danish experiences)", it won't be discussed in detail. The topic is broader, on general approaches of mapping concepts.

heidivanparys commented 3 years ago
  • Note 5 to entry: Typically, the data in a dataset are related through a common topic.

  • Note 6 to entry: Typically, the data in a dataset have the same syntactic structure.

  • Note 7 to entry: Typically, the data in a dataset are managed using the same governance processes.

  • Note 8 to entry: Typically, the data in a dataset have a shared data provenance.

@makxdekkers mentioned earlier, and I agree, that these seem too restrictive to include in the definition. Those interpretations are possible with the current DCAT vocabulary (considering datasets and distributions).

There seems to be a misunderstanding on the use of notes? The notes are not considered to be a part of the definition. As @aidig already wrote in https://github.com/w3c/dxwg/issues/1195#issuecomment-566719086

The notes are meant to illustrate typical - but not defining - characteristics of datasets, and the related examples are thus presented as argumentation for why these typical characteristics should not form an integral part of the short formal definition as it quite rightly would be too restrictive. (Like the current definition could be perceived as stating excluding conditions) All the notes begin with "Typically..." and are to be considered supplementary comments to the definition.

See also e.g. section 2.3 in http://www.nordterm.net/filer/publikationer/guider/Guide_to_Terminology.pdf

Definitions shall be as brief as possible. Carefully written definitions should contain only information required to place the concept correctly in the concept system. Any additional information or examples should be placed in a note. Such additional information could be, for example, the most important inessential characteristics or a list of typical objects included in the extension of the concept.

The proposal basically follows the formatting of ISO 10241-1, see also https://www.iso.org/sites/directives/current/part2/index.xhtml#_idTextAnchor216. The notes would be "usage notes" using SKOS terms, as used in the DCAT documentation.