w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
139 stars 55 forks source link

question > is a software solution a dcat:Dataset? #1221

Closed bertvannuffelen closed 4 years ago

bertvannuffelen commented 4 years ago

Dear community,

I would like your advice on the following topic:

Can a software solution be considered as a dcat:Dataset?

RubenVerborgh commented 4 years ago

A collection of data, published or curated by a single agent, and available for access or download in one or more representations. —https://w3c.github.io/dxwg/dcat/#Class:Dataset

A software solution fails to meet the "collection of data" for me.

makxdekkers commented 4 years ago

In my mind, we should not try to put limits on what can be a dcat:Dataset. Any digital object that is published or curated by a single agent, and available for access or download in one or more representations qualifies. All discussions that I remember from the last decade always ended up with that conclusion. If people want to describe software as a dcat:Dataset, no-one can stop them. In fact, if I remember correctly, the development of ADMS-AP in Europe explicitly included software packages as one of the Assets types (with Assets modelled as instances of dcat:Dataset).

RubenVerborgh commented 4 years ago

In my mind, we should not try to put limits on what can be a dcat:Dataset.

But then we need to change the definition.

If people want to describe software as a dcat:Dataset, no-one can stop them.

Yeah, but no one can stop them from describing software as ex:Mammal either. The question is what the DCAT spec says about applicability.

makxdekkers commented 4 years ago

In my mind, we should not try to put limits on what can be a dcat:Dataset.

But then we need to change the definition.

I don't see why. The definition is sufficiently broad to encompass also software.

If people want to describe software as a dcat:Dataset, no-one can stop them.

Yeah, but no one can stop them from describing software as ex:Mammal either. The question is what the DCAT spec says about applicability.

Well, it would depend on the definition of ex:Mammal whether that is 'wrong' or not. The point is that it is very well possible -- and it has been done in ADMS-AP -- to consider software a dcat:Dataset. As far I see it, software is a type of digital resource, and I would argue it is a collection of instructions or procedures. As long as it is published or curated by a single agent, and available for access or download in one or more representations I don't see anything in the DCAT spec that would lead to a conclusion that software would not qualify.

RubenVerborgh commented 4 years ago

The definition is sufficiently broad to encompass also software.

The official definition starts with

A collection of data, published or curated…

Your take above started with

Any digital object that is published or curated…

Different IMHO.

As far I see it, software is a type of digital resource, and I would argue it is a collection of instructions or procedures.

But then I'd argue that a JPEG file is a collection of pixels, etc. I don't mind that, it's just that in that case "dataset" becomes "every digital object". Which does not seems like the intention.

published or curated by a single agent

So collaboratively edited open-source software doesn't count?

Another question is what we gain by including software, JPEG, etc. into the definition of dataset. It then essentially becomes equivalent to "information resource", including all RDF resources except those that are non-information like real-world entities. So not sure if the concept of dataset is than any longer very meaningful.

makxdekkers commented 4 years ago

@RubenVerborgh Yes, every discussion that I've seen over the years always ended up with the conclusion that, yes, any digital resource qualifies as long as there is someone (person, group, organisation) that publishes and curates it.

Information published as HTML files or PDFs, images represented as JPEGs or PNGs, music encoded in MP3, tables and spreadsheets published in Excel or CSV, basically anything goes. The conclusion has always been that we either leave it very broad, or we need to draw the boundaries in such a way that it is completely clear what is in and what is out. I remember spending a lot of time on that discussion several times and we have never been able to agree on those boundaries. Now, the situation is that DCAT has been around for six years with the 'vague' definition, and narrowing down the definition could potential break implementations, in case people have interpreted the definition in a liberal way.

So rather than posing the question "what [do] we gain by including software, JPEG, etc. into the definition of dataset?", what we should be asking is "what do we gain by retrospectively narrowing the definition of dataset to exclude certain types of digital objects?".

I have said it before and I'll say it again: let's not go there -- we can argue for a long time, like we've done in the past, and in the end, in my humble opinion, it's not going to make things any better. Good is good enough. Let's work with what we have.

dr-shorthair commented 4 years ago

DCAT defines two sub-classes of dcat:Resource - dcat:Dataset anddcat:DataService. Other kinds of resource can be defined as additional sub-classes of dcat:Resource to support other applications.

We were very clear about this in https://www.w3.org/TR/vocab-dcat-2/#dcat-scope and https://www.w3.org/TR/vocab-dcat-2/#Class:Resource

@bertvannuffelen My hunch is that software and code is a stretch within 'dataset' but it would not be hard to specify a new class for your application.

aidig commented 4 years ago

DCAT does indeed define two sub-classes of catalogues resources dcat:Dataset and dcat:DataService, and suggests both the definition of additional sub-classes for other applications. However, it also suggests the use of dcat:Resource and classifying it (via dct:type). If it is necessary to coin new sub-classes, one should hope it would be possible to gain insight from the existing well governed and broadly recognised set of resource types (controlled vocabularies)...

The class of all cataloged resources, the super-class of dcat:Dataset, dcat:DataService, dcat:Catalog and any other member of a dcat:Catalog. This class carries properties common to all cataloged resources, including datasets and data services. It is strongly recommended to use a more specific sub-class. When describing a resource which is not a dcat:Dataset or dcat:DataService, it is recommended to create a suitable sub-class of dcat:Resource, or use dcat:Resource with the dct:type property to indicate the specific type. (https://www.w3.org/TR/vocab-dcat-2/#classifying-dataset-types)

The application of the recommended controlled vocabularies for the classification of dataset/resources is - however - somewhat unclear as described in this issue: jf. Unclear classification of dataset/resources. [ID1187] Note, btw, that the type 'Software' is a type separate from 'Dataset' in DCMI type vocabulary, ISO19115 MD_ScopeCode AND DataCite ResourceType.

See also,:

makxdekkers commented 4 years ago

@bertvannuffelen My hunch is that software and code is a stretch within 'dataset' but it would not be hard to specify a new class for your application.

So I guess we can't avoid to discuss this yet again...

@dr-shorthair When you say it "is a stretch" you are basically talking about boundaries -- you apparently have an idea of what a 'collection of data' is and what is not. As far as I am concerned, if we want to make this operational, we would need to have a clearly and explicitly written definition of Dataset that would make it abundantly clear to anyone that software can't possibly be in that same class. The case is that we don't have such a definition.

I am wondering whether your opinion is that software is not a 'collection of data'? Look at https://en.wikipedia.org/wiki/Software: "...software, is a collection of data or computer instructions that tell the computer how to work". Not saying that I think Wikipedia is the best place to get good definitions, but at least it shows that there are people who think software is a collection of data.

Of course, this group could decide to narrow down the definition or issue additional guidance to say that only data of a certain, clearly defined or enumerated, kind can be considered in-scope -- but then we get into trouble with backward compatibility, given that I know for sure that people have described software with ADMS which is a W3C-recognised profile of DCAT.

So, I'd ask again "what do we gain by retrospectively narrowing the definition of dataset to exclude certain types of digital objects?" and also, "what problems does narrowing the definition create for existing applications?"

dr-shorthair commented 4 years ago

I would go the other way round and ask 'what are the key descriptors or metadata required to index software in a catalog?'.

If they are the same as dcat:Dataset then it is fine.

If they use the properties of dcat:Dataset but need more, then sub-class the Dataset class.

If there are properties of dataset descriptions that do not apply to software, then another sub-class of dcat:Resource (i.e. a sibling class to dcat:Dataset) might be a better option.

makxdekkers commented 4 years ago

@dr-shorthair I understand your point. But I am just wondering if this is worth the time and effort in this group to try and analyse the requirements of various types of collections of data. Your list of criteria is interesting but I am not quite sure I agree that these are general rules to apply For example, I don't see the need to create a sibling class if you don't need all the properties of a class. You are allowed to use only the things you need; there is no requirement to use all. I think that's the approach at schema.org. What I am worried about is that we start to take on Dataset policing activities and every time someone comes with a catalogue with 'collections of data', we will go into a discussion to determine if we're OK to let that into the definition of Dataset or not. I don't think that is useful.

dr-shorthair commented 4 years ago

We definitely do not need to police this. We should just explain

'if either of the classes dcat:Dataset or dcat:DataService meet your catalog needs, then use them. If your cataloguing needs requires a different combination or additional properties, then consider extending one of the existing classes, or defining another sub-class of dcat:Resource'.

Then it is up to the user.

makxdekkers commented 4 years ago

That's a good proposal. I agree.

agreiner commented 4 years ago

With that proposal, it becomes impossible to correct deficiencies in how DCAT defines a dataset. It assumes a dataset is all it should ever be. I would like to hear what @bertvannuffelen saw as a concern, whether it was just a matter of whether it was permissible to use the term dataset for software, or if he found issues in attempting to use it.

bertvannuffelen commented 4 years ago

@all, thanks for the feedback.

The reason why I posted this question is exactly to understand what the intented meaning of 'a collection of data', and what would be the expected usage of DCAT is.

When I design models I personally like to have that my intuition of the human readeable definition matches the actual shared machine readable information. Personally 'a collection of a data' is not what I would use to define 'software'. At most it is a part of what I would consider software. Like the wikipedia definition also indicated.

Secondly, this question is raised by me because the introduction/motivation only speaks about "Open Data Portals", but not of alternative catalogues like source code catalogues like github.com. This feels a bit ambivalent to me. As indicated by @makxdekkers I can follow the reasoning that we cannot prohibit/police what should be in. But there is in the introduction a mindset described of the usage context. And that mindset does not refer to catalogues of source code, documents, pictures, catalogues, ... Because of that it is natural to conclude that DCAT (dcat:Dataset) is not intented for those things.

Moreover, personally, it is not because DCAT can aid me in creating a catalogue of things I should name it these things a dcat:Dataset. I think that is also partially the reason why dcat:Resource has emerged. So personally, I am inclined to create a new subclass of dcat:Resource for cataloguing software.

Out of the discussion, I additionally learned the following: I want to do some cool stuff with the described items in the catalogue: such as visualise, analyse, aggregate, ... What I like the most of DCAT is, that it allows me to aggregate catalogues at zero cost and that the same queries executed on the individual catalogues as on the aggregated catalogues returns me the same answer. It turns out that aggregating catalogues is only meaningful if the items in the DCAT are of the same nature. Out of the discussion above, it is clear that DCAT will not facilitate that. But that the specific usage contexts should make that clear. Fine with me.

bertvannuffelen commented 4 years ago

@agreiner, your comment highlights an attention point, for the future:

With that proposal, it becomes impossible to correct deficiencies in how DCAT defines a dataset. It assumes a dataset is all it should ever be.

During this discussion the difference in how people interpret 'a collection of data' has been highlighted. That indeed means that semantical (definitional) issues might occur. This is normal, but therefore the expected usage context of DCAT should be clear. There is always a vagueness in human language, but for machines there is no vaguess: it is true or it is false. It is the binding between these two worlds that a vocabulary should do as good as possible, otherwise the outcome of reasoning by a machine will not match the intuition of the human reasoning. Lets balance on the cord, but at least we are aware of the difference.

makxdekkers commented 4 years ago

@bertvannuffelen

I have no problem if we want to make it explicit that specific types of data should be modelled as sub-classes of dcat:Resource.

What I think we can't do is formalise what you call your 'mind set' for dcat:Dataset by narrowing down the definition, as this could potentially break existing implementations. Those implementations were developed before the mechanism of subclassing dcat:Resource was available so we can't penalise them for interpreting the scope of dcat:Dataset liberally.

What I remember is that we found out in the development of DCAT 2014 that it was quite hard to define the types of collections of data that should be in scope -- for example, is it limited to numerical data in an n-dimension grid (i.e. the spreadsheet paradigm) or can you have other types of observations/data points, what about data underlying maps, what about sound snippets used in language research, image collections etc. etc. There is a large grey area in the mind set of many people, and these mind sets may not always be well aligned. So the best we could do at the time was to leave it open.

bertvannuffelen commented 4 years ago

What I think we can't do is formalise what you call your 'mind set' for dcat:Dataset by narrowing down the definition, as this could potentially break existing implementations. Those implementations were developed before the mechanism of subclassing dcat:Resource was available so we can't penalise them for interpreting the scope of dcat:Dataset liberally.

It is not about penalizing existing implementations. But at the same time, existing implementations could now reconsider that choice. Other-way around, it is not because somewhere somebody created a catalogue of e.g. vehicles using dcat:Dataset, that DCAT must accept this as a good and desired practice. I hope that a vocabulary community has the freedom to state about implementations that the application of the vocabulary is not as intended. Unfortunately, here in this case, we hit the issue that the definition of dcat:Dataset can be interpreted to be anything.
So even my last example of vehicles cannot be excluded. ;-)

What I remember is that we found out in the development of DCAT 2014 that it was quite hard to define the types of collections of data that should be in scope -- for example, is it limited to numerical data in an n-dimension grid (i.e. the spreadsheet paradigm) or can you have other types of observations/data points, what about data underlying maps, what about sound snippets used in language research, image collections etc. etc. There is a large grey area in the mind set of many people, and these mind sets may not always be well aligned. So the best we could do at the time was to leave it open.

I agree it is grey area.

makxdekkers commented 4 years ago

@bertvannuffelen

... somewhere somebody created a catalogue of e.g. vehicles using dcat:Dataset, that DCAT must accept this as a good and desired practice. I hope that a vocabulary community has the freedom to state about implementations that the application of the vocabulary is not as intended.

But now you are arguing for this working group to take on the role of Dataset Police! I don't think that we as a group are in the business of declaring good or bad practices -- we're in the business of defining a vocabulary that we think is useful for the description of all kinds of data catalogues.

I quite like @dr-shorthair's proposal which leaves it really to the user to decide. If they think DCAT meets their needs, please let them use it -- if not, they can define something else, either as an extension of DCAT by creating a local subclass of dcat:Resource, or by creating their own vocabulary outside of it.

kcoyle commented 4 years ago

While there is no possibility nor desire to do policing of other people's metadata, interoperability is facilitated through clear definitions and good examples. If some users wish to take a "whatever" view, that's their prerogative, but a common understanding is helpful for those who wish to exchange data.

I often feel that the short definitions that we adopt, while good "sound bites," should be augmented with more extensive explanations of intention. Intention does not mean enforcement, but it would be informative for potential downstream users.

agreiner commented 4 years ago

I agree that it should be up to the user to determine whether a resource is in fact a dataset, but I think we could provide some guidance that would help people understand what scope is intended. Call me an optimist, but I don't think this is an insurmountable issue. To address Makx's most relevant question, I think we stand to gain clarity in decision making if we can avoid having to generalize the vocabulary to cover every bit of web content out there. We can also avoid calling for unrealistic levels of cooperation (e.g., asking all publishers of content to do something that really only applies to all publishers of data). If our scope includes the entire web, we are doomed to try to boil the ocean. But I don't think this means we can't acknowledge the multiplicity of types of data out there.

I think it's clear that any web resource (not any thing, so not vehicles) can be treated as data, especially in the age of machine learning. There is plenty of precedent for running text being used as data, so why not software code? In the Scope section of DCAT 2, we mention several types of media and "potentially other types" of data. In my opinion, any given content type can be considered data, but that doesn't mean that the vocabulary needs to be generalized to cover all possible instances of the various things on the web, whether they are intended to be used as data or not.

The difference, in my mind, is intent. If a thing is published with the intention of making it available for mathematical analysis, then it is data, and collections that include things of its type should be describable with DCAT. If a thing is published online without that intention, then there is no need for it to be describable with DCAT. Webster's 3rd provides a useful definition: "A magnitude, figure, or relation supposed to be given, drawn, or known in a mathematical investigation from which other magnitudes, figures, or relations are to be deduced." That is all about intent.

andrea-perego commented 4 years ago

We stumbled upon this issue when defining mappings from ISO 19115 and DataCite to DCAT-AP, since both ISO 19115 and DataCite support different resource types. The adopted solution is described in UC20:

https://www.w3.org/TR/dcat-ucr/#ID20

Basically, the approach was to use dcat:Dataset whenever possible, and its broader sense, but also use soft typing to specify the "type" of dataset, by reusing the relevant classes in DCMI Terms.

This approach is elaborated in Section 6.1 of the specification documenting the mappings from DataCite to DCAT-AP:

https://ec-jrc.github.io/datacite-to-dcat-ap/#alignment-issues-resource-types

bertvannuffelen commented 4 years ago

@makxdekkers

... somewhere somebody created a catalogue of e.g. vehicles using dcat:Dataset, that DCAT must accept this as a good and desired practice. I hope that a vocabulary community has the freedom to state about implementations that the application of the vocabulary is not as intended.

But now you are arguing for this working group to take on the role of Dataset Police!

I think you misread my comment. It was not about policing, it is about expressing as DCAT vocabulary group as clear as possible the intend of DCAT. In that exercice, past/existing implementations can be an inspiration, but in my personal view, not at all cost all implementations should be fittable into the intend. It is really fine for me that an existing implementation might fall (partially) out the intended scope definition, if that leads to a more coherent story for DCAT.

I don't think that we as a group are in the business of declaring good or bad practices -- we're in the business of defining a vocabulary that we think is useful for the description of all kinds of data catalogues.

If that is the intend: namely to cover all data catalogues, please express that using an introduction that explicitly states that. Open Data Portals are a specific kind of data catalogues in that sense. So it is better to not solely use that usage context as motivation. Maybe because for you 'a collection of data' means intuitively any 'digital object', but that is not the case for everyone.

I hope I made it clear that it is for me not about policing, but about ensuring that the intuitive reading of the specification leads to an intuitive usage.

makxdekkers commented 4 years ago

@bertvannuffelen I understand what you're proposing. It's just that I don't think we should be in the business of telling people what to do.

My main worry is that if we describe what this group -- consisting of a very small set of stakeholders -- sees as the intention of the vocabulary, there might be people that have data collections that could benefit from DCAT, who might read that intention and decide to develop their own vocabulary.

So it depends on perspective:

  1. If you don't want people to use DCAT for things that you think are not in its scope, you try to define the scope more precisely; in doing so, you encourage those people to go away and develop something else.
  2. If, on the other hand, you don't want that people who could use DCAT for their data collection, do not use it because of a narrow scope, you define the scope liberally; that way you encourage as many people as possible to use DCAT and avoid the proliferation of vocabularies.

I am definitely in the second camp. In my opinion, that creates more interoperability, not less, because all the various types of data collections would use the same vocabulary -- probably with extensions and profiling -- and might benefit from the same catalogue management software and processes.

If I understand correctly, this was the approach mentioned by @andrea-perego. To me that makes more sense than narrowing the scope.

dr-shorthair commented 4 years ago

I think we can lean towards functional definitions. If the properties satisfy your requirements, then the class will do, regardless of what it is called.

bertvannuffelen commented 4 years ago

@makxdekkers

I do not think you really understand me:

My stand is that if DCAT is intended to be broadly applicable, fine with me.

My suggestion is then to loosen the connection of DCAT from the Open Data portals. Open Data Portals come with their view on what is dataset and catalogue, which should be then according to you a specific application of DCAT.

In the introduction, the first paragraph ends with:

which was originally developed in the context of government data catalogs such as data.gov and data.gov.uk, but it is also applicable and has been used in other contexts.

We could add here an example of e.g. a picture catalogue, or a software repository.

For the definition of dataset: a usage note could make clear that a 'collection of data' is more than an dataset in an Open Data portal. And add examples such as pictures, software etc.

makxdekkers commented 4 years ago

@bertvannuffelen OK, now I understand. I have no problem adding examples as you suggest.

dr-shorthair commented 4 years ago

Good discussion folks. Though it seems we went the long way round to get to a simple solution i.e. more examples, including software and image catalogs!

kcoyle commented 4 years ago

If you do add examples I would hope that they would be "real" examples. If there is a picture catalog using DCAT then it could be listed, but not if none exist. I say this because I am skeptical about that particular use as I know several such catalogs and they would not easily make use of DCAT. So as long as the examples are of uses that have been proven then of course they should be added. Speculating about possible uses, unless fully researched, could be misleading.

heidivanparys commented 4 years ago

Can a software solution be considered as a dcat:Dataset?

I guess it depends on the definition of "data" used.

As mentioned earlier by @aidig , according to DCMI Metadata Terms and DataCite Metadata Schema 4.3, software ("A computer program in source or compiled form.") is not considered to be a dataset ("Data encoded in a defined structure."):

image

image

I would like to add that according to the Information Artifact Ontology, software is not considered to be a dataset either.

image

Some definitions are listed below.

In short, the interesting point is here that data items are defined as "intended to be truthful statements about something", so it is more narrow then just any resource.

If DCAT is supposed to be used for describing "anything", than the meaning of dcat:Dataset seems to become as broad as as the meaning of what others would call an "information resource", defined as:

[CBED] Cambridge University Press: information resource. Cambridge Business English Dictionary (2011), https://dictionary.cambridge.org/dictionary/english/information-resource [Hay] Hay, David C.: Chapter 10: Documents and other Information Resources. In: Enterprise Model Patterns: Describing the World (UML version) (2010) [ISO 5127:2017] ISO/TC 46: ISO 5127:2017 Information and documentation — Foundation and vocabulary. International Standard (2017), https://www.iso.org/obp/ui/#iso:std:iso:5127:ed-2:v1:en

information content entity
A generically dependent continuant that is about some thing.
data item
a data item is an information content entity that is intended to be a truthful statement about something (modulo, e.g., measurement precision or other systematic errors) and is constructed/acquired by a method which reliably tends to produce (approximately) truthful statements.
data set
A data item that is an aggregate of other data items of the same type that have something in common. Averages and distributions can be determined for data sets.
directive information entity
An information content entity whose concretizations indicate to their bearer how to realize them in a process.
plan specification
A directive information entity with action specifications and objective specifications as parts that, when concretized, is realized in a process in which the bearer tries to achieve the objectives by taking the actions specified.
software
Software is a plan specification composed of a series of instructions that can be interpreted by or directly executed by a processing unit.
heidivanparys commented 4 years ago

An addition to my comment above: according to schema.org, a software application is not a dataset (but both are "creative works"):

image

andrea-perego commented 4 years ago

Thanks for this comparison with related specifications, @heidivanparys .

I can assure you they have all been taken into account by the DXWG while working on DCAT2. However, the point is that the notion of dcat:Dataset was indeed very broad since DCAT was first published, and it may well correspond to the one of "information resource". This is how it has been implemented since 2014, which is one of the reasons why the decision of the DXWG was not to narrow down its scope.

However, the general issue is - as @makxdekkers said earlier in this thread - that there is no agreed definition of "dataset" across communities and domains. Since DCAT is meant to be domain-independent in order to support metadata interoperability, the notion of "dataset" should necessarily be broad.

Note that this does not mean that you MUST use dcat:Dataset for anything. If for your purposes, community, catalogue, etc. it is important (or you just prefer) to use a different class for software, images, etc., there's nothing preventing you doing that.

This is also what was done in the work I mentioned earlier about the mappings of ISO 19115 and DataCite to DCAT-AP - see https://github.com/w3c/dxwg/issues/1221#issuecomment-596153171 . In that cases, however, resources were specified ALSO as dcat:Datasets (when possible), in view of the sharing and re-use of metadata records across catalogues and domains.

riccardoAlbertoni commented 4 years ago

@bertvannuffelen wrote:

For the definition of dataset: a usage note could make clear that a 'collection of data' is more than an dataset in an Open Data portal. And add examples such as pictures, software etc.

I have noticed that section 5.1 already exemplifies quite a wide range of examples, including pictures as connections of pixes. So in the spirit of @bertvannuffelen's suggestion, I would reinforce the message adding the sentence in bold in section 5.1 and add a usage note on the same line:

In section 5.1:

dcat:Dataset represents a dataset. A dataset is a collection of data, published or curated by a single agent. The notion of dataset in DCAT is broad and inclusive, as DCAT aims at accommodating all the definitions arising from domain-specific communities. Data comes in many forms including numbers, words, pixels, imagery, sound and other multi-media, and potentially other types, any of which might be collected into a dataset.

As usage note for dcat:Dataset: (I have dropped "A dataset is a collection of data, published or curated by a single agent. " as it is already included in the dcat:Dataset definition.)

The notion of dataset in DCAT is broad and inclusive, as DCAT aims at accommodating all the definitions arising from domain-specific communities. Data comes in many forms including numbers, words, pixels, imagery, sound and other multi-media, and potentially other types, any of which might be collected into a dataset.

@dr-shorthair, @andrea-perego, @kcoyle, @makxdekkers and all, Would this work for you? feel free to re-edit the proposals as you wish.

makxdekkers commented 4 years ago

Looks good to me, with one minor suggestion: replace "words" by "text" -- not all languages have a notion of "word".

dr-shorthair commented 4 years ago

In section 5.1:

dcat:Dataset represents a dataset. A dataset is a collection of data, published or curated by a single agent or identifiable community. The notion of 'dataset' in DCAT is broad and inclusive, with the intention of accommodating resource types arising from all communities. Data comes in many forms including numbers, text, pixels, imagery, sound and other multi-media, code, software, and potentially other types, any of which might be collected into a dataset.

??

riccardoAlbertoni commented 4 years ago

I agree that code and software might be modeled as datasets in some context. However, I would not encourage that explicitly mentioning them in the above sentence as it might look confusing and inconsistent with the resource subclassing (explained in the previous item and the end of section 5.1).

so in the third item of section 5.1, I would prefer

dcat:Dataset represents a dataset. A dataset is a collection of data, published or curated by a single agent or identifiable community. The notion of 'dataset' in DCAT is broad and inclusive, with the intention of accommodating resource types arising from all communities. Data comes in many forms including numbers, text, pixels, imagery, sound and other multi-media, and potentially other types, any of which might be collected into a dataset.

If we can't avoid mentioning code and software, we might consider rephrasing some of the discussion provided by @andrea-perego in a previous post. For example, adding code and software as extreme examples in the discussion after the first note in section 5.1.:

A dataset in DCAT is defined as a "collection of data, published or curated by a single agent or identifiable community, and available for access or download in one or more serializations or formats". Since DCAT is meant to be domain-independent to support metadata interoperability, the notion of "dataset" should necessarily be broad. In specific contexts, catalogs can consider datasets resources that are usually not considered so. For example, code and software might be regarded as datasets in code and software catalogs. Note that this does not mean that dcat:Dataset MUST be used for anything. Adopters, communities, catalogs can borrow third parties classes or defining distinct classes for modeling specific types of entities.

agreiner commented 4 years ago

I think it's okay to mention code, but I would not use the term software in addition. To the extent that they are different things, code is the data-like form of software. The code is what one would use for analysis.

akuckartz commented 4 years ago

@agreiner wrote

... code is the data-like form of software. The code is what one would use for analysis.

Why restrict the purpose to "analysis" ?

agreiner commented 4 years ago

Because that is the actual goal of data. I mean analysis in the broad sense of attempting to gain understanding from it. That would include things like looking at a visualization, directly inspecting a datum in csv, or using a computational system to perform a search. There are plenty of things that happen in the lifecycle of data, but they all have the goal of some sort of analysis in the end. That need drives the transitions that happen in the life cycle that make it suitable for use as data rather than, say, software or music. The nearest thing to being an exception that I can think of is data art, but even then, what makes it data art rather than some other kind of digital art is that it uses ones and zeroes that have been organized as they would be for analysis, and the result is something that can provide understanding in more ways than one.

kcoyle commented 4 years ago

Again, I would discourage using as examples any data type for which there is no existing or proven DCAT use. Until someone finds it actually useful in practice you only have speculation, and that speculation could misinform someone coming to DCAT for the first time. People can decide for themselves if DCAT meets their needs without misdirection. When there are examples of a variety of uses, those can become part of the DCAT documentation.

I support @agreiner 's approach, as someone who has about 40 years experience with catalogs, none of which I would consider to be catalogs of datasets even though there are many digital materials in those catalogs. Dataset is really distinct from other digital resources in my world, and we do manage digital and physical resources, including datasets, digital art, electronic texts, etc., in our catalogs. I don't actually know of any community that lumps together all digital resources as "datasets" and I wouldn't consider it useful to imply such a broad definition.

riccardoAlbertoni commented 4 years ago

@kcoyle : I concur with your idea that we should not be too speculative here. In the spirit of accommodating the different legitimate positions, provided that we do not mention code and software explicitly, Would the following sentence (the same proposed in the previous post https://github.com/w3c/dxwg/issues/1221#issuecomment-600023334) work for you? Note that the part not in bold was already included in DCAT 2 sec 5.1 third item.

dcat:Dataset represents a dataset. A dataset is a collection of data, published or curated by a single agent or identifiable community. The notion of 'dataset' in DCAT is broad and inclusive, with the intention of accommodating resource types arising from all communities. Data comes in many forms including numbers, text, pixels, imagery, sound and other multi-media, and potentially other types, any of which might be collected into a dataset.

dr-shorthair commented 4 years ago

Yes, I agree. My proposal above overreached by adding the code and software examples. Particularly as this definition - from Section 5.1 - is immediately preceded by the definition of dcat:Resource which is the general case.

I'm wondering if/where we should elaborate the explanation that if you want to build a catalogue of something that is not a Dataset or a DataService, then you could create a sibling sub-class of dcat:Resource - e.g. software, samples, museum specimens ... Or should we just leave the specific examples implied by what is said in https://www.w3.org/TR/vocab-dcat-2/#Class:Resource and https://www.w3.org/TR/vocab-dcat-2/#dcat-scope

kcoyle commented 4 years ago

@riccardoAlbertoni I like your definition although the last sentence still slips over into some data types that I think are a bit suspect. Can we find a way to say it without naming types that we aren't sure about? Datasets surely must be digital in nature (ones and zeroes). They also are "sets" ("collections of data"). Saying that may be sufficient.

@dr-shorthair That also relates to your suggestion of "e.g. software, samples, museum specimens" - the latter would have to be in a digital form, such as a scanned image. But I am unsure if you mean that you see DCAT being used to catalog individual images as datasets. Is that what you (and others) are implying?

dr-shorthair commented 4 years ago

@kcoyle not sure I agree. A sub-class of dcat:Resource for real-world things (for example, ex:Specimen) might add additional descriptors related to things - e.g. physical dimensions. In fact the individual record of type ex:Specimen might serve as the landing page for something that has no other digital or web presence. In this way, a museum catalog (for example) could modeled on the DCAT catalog - not a dataset-catalog of course, but a specimen-catalog.

pwin commented 4 years ago

I agree with Simon I see dcat as more about a catalog than about the nature of what is being catalogued.

There's discovery metadata and conformity constraints (again, to help with subsetting or accurate placement of the catalog contents in mind) in the model, and this is all about cataloguing.

On Sun, 29 Mar 2020, 09:41 Simon Cox, notifications@github.com wrote:

@kcoyle https://github.com/kcoyle not sure I agree. A sub-class of dcat:Resource for real-world things (for example, ex:Specimen) might add additional descriptors related to things - e.g. physical dimensions. In fact the individual record of type ex:Specimen might serve as the landing page for something that has no other digital or web presence. In this way, a museum catalog (for example) could modeled on the DCAT catalog

  • not a dataset-catalog of course, but a specimen-catalog.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/w3c/dxwg/issues/1221#issuecomment-605604320, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIFYTGYU6AZ7SWDNXAGYA3RJ4CUXANCNFSM4LBF76AA .

kcoyle commented 4 years ago

So here's what I think needs to be made clear:

There are catalogs of metadata for things in the real world - a library catalog is an example of this. A catalog of machine parts or an office supply inventory is also such an example. In this case there is no dcat:Distribution because there is no digital file to point to. (Note that a library catalog is often today a mix of descriptions of non-digital and digital objects.)

There are catalogs of digital "things", like a catalog of scanned images. In this case there is a dcat:Distribution: the scanned thing. There may also be a metadata "record" describing the scanned thing.

I would like to see those modeled in DCAT because I think that getting this discussion more "real" matters. Let's show our work. I can provide examples. For the first case:

https://catalog.loc.gov/vwebv/holdingsInfo?searchId=27861&recCount=25&recPointer=2&bibId=4749563

And for the second case: https://www.loc.gov/item/afc1941005_ms028/

Don't worry about coding these precisely, just mock up what data would be where in the DCAT model using any pseudo-code or diagram of your choosing.

I am wondering if one would use the class dcat:Dataset for these resources, and if so then I would say that the definition could be confusing:

"A collection of data, published or curated by a single agent, and available for access or download in one or more representations."

since most people are not going to consider a single file a "collection of data". A single file is not a dataset by this definition. Now you might say that a catalog represents a collection of data, which works, IMO. But dcat:Dataset does not represent the catalog, it is at the logical level of the single "thing" - the distribution. Which is where we started with all of this - Is "dataset" anything digital? If so, the definition of dataset needs to change, to remove "collection", if nothing else.

makxdekkers commented 4 years ago

To be honest, I find a discussion on how you could model catalogues of things with DCAT a bit theoretical. Do we really expect people to start repurposing the Data Catalog vocabulary for such catalogues? Maybe, but is that something that is urgent to consider at this point in time?

I do not understand @kcoyle when she writes that most people are not going to consider a single file a "collection of data". As far as I understand, the whole concept of a dcat:Dataset is that it, in its most simple form, is associated with a single file, described by dcat:Distribution. The file contains pieces of data (e.g. observations) and is accessible at dcat:accessURL or dcat:downloadURL. How is such a file not a collection of data?

kcoyle commented 4 years ago

Sorry, @makxdekkers, I didn't say that right. What I mean is a single digital resource - a digital photograph, an mp3 music file - each being the digital representation of a single "thing".

And I agree with your first paragraph - what is the need to attempt to redefine DCAT for other types of catalogs? If people find it useful then alternative uses will arise and can be evaluated (or admired) at that moment in time.

makxdekkers commented 4 years ago

@kcoyle I understand your opinion about a file with a single 'thing', but I think it is really hard to declare something a single thing -- it really depends on your perspective. On the extreme end of things, I guess we can agree that a single number (like "26") is not a dataset, but beyond that things get more difficult.

The example of an MP3 file, in my mind, is the wrong perspective. What to me is important what the MP3 file is a distribution of. For example, there could be a set of oral histories or a podcast with several items that are distributed as an MP3 file. And an image could contain a visualization (for example this one) that could be a graphic representation of a 'collection of data', which is distributed as a PNG file. From one perspective, you could say it is a 'single thing', but someone else could have a completely different perspective in which it is really a 'collection of data'.

So, again, I think that trying to pin down what is a dataset and what is not, is something that we could argue about ad infinitum, and I think we should not try. I keep coming back to @dr-shorthair's "'if either of the classes dcat:Dataset or dcat:DataService meet your catalog needs, then use them. If your cataloguing needs requires a different combination or additional properties, then consider extending one of the existing classes, or defining another sub-class of dcat:Resource'."

heidivanparys commented 4 years ago

Datasets surely must be digital in nature (ones and zeroes)

I don't agree with that. See e.g. some historical examples described at https://en.wikipedia.org/wiki/Census and https://en.wikipedia.org/wiki/Census_in_Egypt :

So the result of a census is a dataset, isn't it? The term wasn't used back then, but with today's language it would be called a dataset. It contains data on the number of men and/or number of inhabitants, possibly registered per administrative unit, and the data allow to do certain types of analysis. Taxes can be calculated etc.

Another example: I could write down in my analogue bullet journal or diary on what days I went for a 5 km run and also keep track of the amount of time in which I completed that run. To me, that is a dataset as well. I could also take a scan of that sheet of paper (or sheets of paper) and send that to my running coach as a PDF-document. And I could, at some time, enter those data in a spreadsheet and import into some smart application that can create nice diagrams etc. So, one dataset, different representations of it.

So narrowing the dataset definition down to "digital" would be undesirable IMO (just as "published or curated by a single agent" is narrowing both resource and dataset down in an undesirable way).