Closed jakubklimek closed 6 years ago
I should have written
<resource1> , <resource2> , <resource3> , <resource4> , <resource5> , <resource6> , <resource7>
though in practice they usually are files in repositories.
In strict DCAT terms
<file1> , <file2>
are probably better modelled as other individual dcat:Datasets
, so their descriptions should have URIs in the context of a catalog<file3> , <file4>
are probably dcat:Distributions
, so the descriptions would often be blank nodes, with a downloadURL
or accessURL
to the actual file<file5>
is probably another dcat:Dataset
though preferably to an online resource (standard schema!)<file6>
should probably be another dcat:Dataset
<file7>
is a document stored as part of the package. Again, strictly another dcat:Dataset
somewhere. But this is all idealized. The point is that most repositories do not require the depositor to make such distinctions, and as long as manually-completed forms are involved there will be resistance or non-compliance from the kind of data depositors that I have in mind (researchers). There might be some heuristics that could be applied, and future automation will help. But my proposal is that with the addition of just one axiom we might accommodate the present reality in a way that improves on current habits - where in the absence of something better, everything in a bag of files is often linked to the dataset using dcat:distribution
- which I think we all agree is wrong.
As this discussion moved from the mailing list to this issue, for completeness I'm adding the other messages from the mailing list in this thread.
@makxdekkers said:
Simon,
This is indeed an issue that came up in the development of DCAT-AP. In particular, CKAN is quite liberal in what it accepts as "Resource" related to a Dataset. The discussion was whether you could map CKAN Resource to DCAT Distribution, and it was clear that such mapping would have unwanted effects. This is also related to my earlier question on how "similar" distributions need to be, which led to a statement that they need to be "informationally equivalent" (https://github.com/w3c/dxwg/issues/52).
I support your proposed solution to use dct:relation as a catch-all and to allow for further specialisation whenever necessary and possible.
Makx.
and @andrea-perego said:
Makx, Simon,
In the extension of DCAT-AP we use in the JRC Data Catalogue, besides distributions we typically have (a) related publications and (b) "other resources" (a catch-all category including all what is not a distribution or a publication). As I said elsewhere [1,2], related publications are specified via dct:isReferencedBy, whereas "other resources" with dct:relation (used as a generic relationship to link a dataset with any kind of related resources). So, this use case may support the idea of making dcat:distribution a subproperty of dct:relation.
BTW, this pattern is reflected in our CKAN extension – see, e.g.:
http://data.jrc.ec.europa.eu/dataset/jrc-predict-predict2017-core
About the fact that the majority of data catalogues use a simple metadata pattern, this is also my experience. Hierarchical "is part of" relationships are far from being common. There may be a number of reasons. For instance, if metadata are manually created (as it is still usually the case) this would require a high maintenance effort. Also in the geospatial domain, where there's explicitly this notion ("dataset series"), what is documented is just the "root" dataset, and the children are not even linked to. Another issue may be related to limitations of catalogue platforms – which are typically not supporting this feature – or to the usability issues resulting from giving users the burden to choose among a long list of datasets which are almost identical but for some variables (e.g., spatial and/or temporal coverage).
It is also worth noting that the approach used for specifying hierarchical relationships depends very much on the domain and on specific characteristics of a dataset. We have to deal quite often with this issue in the JRC Data Catalogue, and the approaches used are very different – e.g.: 1 dataset with a distribution for each of its children; 1 dataset for each child dataset, and no record for the parent.
So, probably, we should take into account this situation when providing recommendations on how to model hierarchical/subsetting relationships, and propose alternative options, depending on the specific use case.
Cheers,
Andrea
[1] https://www.w3.org/TR/dcat-ucr/#ID9 [2] https://github.com/w3c/dxwg/issues/63#issuecomment-362108520
@dr-shorthair I see. After giving it some thought, I also quite like the idea of a dcat:distribution
being just one of the possible dct:relation
s.
Still, my main concern is that accommodating this kind of loose description adds complexity to consumers of such data (both people and applications such as data catalogs) in the sense that some DCAT records will be described only by a dcat:Dataset
with a bunch of dct:related
resources, others will have proper dcat:distribution
s and the consumers will have to account for all these possibilities and maybe more. The benefit is that maybe some publishers using dcat:distribution
s wrong, will use dct:related
instead.
In the end, it all comes down to whether we should accommodate existing behavior where datasets are clearly not described well enough (for various reasons), or encourage describing them properly. Maybe this could be done by at least strongly recommending to stick to the Dataset -> Distribution -> File or Dataset -> Data Distribution Service pattern.
This discussion relates to proposed use-case ID53 - #256
Following up on ACTION assigned in this week's DCAT meeting .
Example 1 - undifferentiated set of files each of which is linked to the dcat:Dataset
using dcterms:relation
whose object is a blank node:
dap:d33937
rdf:type dcat:Dataset ;
dcterms:accessRights [
rdf:type dcterms:RightsStatement ;
rdfs:comment "The metadata and files (if any) are available to the public." ;
] ;
dcterms:bibliographicCitation "Cox, Simon (2018): RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale). v1. CSIRO. Data Collection. https://doi.org/10.25919/5b42a082052fa" ;
dcterms:description "A set of RDF graphs representing the International [Chrono]stratigraphic Chart, ..." ;
dcterms:identifier "https://doi.org/10.25919/5b4d2b83cbf2d"^^xsd:anyURI ;
dcterms:issued "2018-07-17"^^xsd:date ;
dcterms:language [
skos:notation "en" ;
] ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcterms:relation <http://resource.geosciml.org/classifier/ics/ischart/> ;
dcterms:relation <http://resource.geosciml.org/ontology/timescale/gts> ;
dcterms:relation <http://stratigraphy.org/> ;
dcterms:relation <https://vocabs.ands.org.au/viewById/196> ;
dcterms:relation [ dcterms:identifier "ChronostratChart2017-02.pdf" ; ] ;
dcterms:relation [ dcterms:identifier "ChronostratChart2017-02.jpg" ; ] ;
dcterms:relation [ dcterms:identifier "isc2017.jsonld" ; ] ;
dcterms:relation [ dcterms:identifier "isc2017.nt" ; ] ;
dcterms:relation [ dcterms:identifier "isc2017.rdf" ; ] ;
dcterms:relation [ dcterms:identifier "isc2017.ttl" ; ] ;
dcterms:relation [ dcterms:identifier "timescale.zip" ; ] ;
dcterms:rights [
rdf:type dcterms:RightsStatement ;
rdfs:comment "All Rights (including copyright) CSIRO 2018." ;
] ;
dcterms:title "RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
dcat:contactPoint <https://people.csiro.au/C/S/Simon-Cox> ;
dcat:keyword "GeoSPARQL" , "OWL" , "OWL-Time" , "RDF" , "SOSA" , "SSN" , "geologic timescale" , "reference system" , "stratigraphy" , "vocabulary" ;
dcat:landingPage <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/information-engineering-and-theory> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/interorganisational-information-systems-and-web-services> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/stratigraphy> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/web-technologies> ;
.
Example 2 - The same dataset with the 'files' linked using more precise semantics - four of the files are representations of the data, one is a copy of the source data, one is a zip archive containing the schema/ontology definitions:
dap:d33937
rdf:type dcat:Dataset ;
dcterms:accessRights [
rdf:type dcterms:RightsStatement ;
rdfs:comment "The metadata and files (if any) are available to the public." ;
] ;
dcterms:bibliographicCitation "Cox, Simon (2018): RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale). v1. CSIRO. Data Collection. https://doi.org/10.25919/5b42a082052fa" ;
dcterms:description "A set of RDF graphs representing the International [Chrono]stratigraphic Chart, ..." ;
dcterms:identifier "https://doi.org/10.25919/5b4d2b83cbf2d"^^xsd:anyURI ;
dcterms:issued "2018-07-17"^^xsd:date ;
dcterms:language [
skos:notation "en" ;
] ;
dcterms:relation <http://resource.geosciml.org/classifier/ics/ischart/> ;
dcterms:relation <http://resource.geosciml.org/ontology/timescale/gts> ;
dcterms:relation <http://stratigraphy.org/> ;
dcterms:relation <https://vocabs.ands.org.au/viewById/196> ;
dcterms:isFormatOf [
rdf:type dcat:Dataset ;
dcterms:source <http://stratigraphy.org/index.php/ics-chart-timescale> ;
dcterms:title "Graphical representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
dcterms:type dctype:Image ;
dcat:distribution [
rdf:type dcat:Distribution ;
dcterms:identifier "ChronostratChart2017-02.jpg" ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:byteSize "1629104"^^xsd:decimal ;
dcat:mediaType <https://www.iana.org/assignments/media-types/image/jpeg> ;
] ;
dcat:distribution [
rdf:type dcat:Distribution ;
dcterms:identifier "ChronostratChart2017-02.pdf" ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:byteSize "296233"^^xsd:decimal ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/pdf> ;
] ;
] ;
dcat:distribution [
rdf:type dcat:Distribution ;
dcterms:identifier "isc2017.jsonld" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:byteSize "698039"^^xsd:decimal ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/ld+json> ;
] ;
dcat:distribution [
rdf:type dcat:Distribution ;
dcterms:identifier "isc2017.nt" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:byteSize "2047874"^^xsd:decimal ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/n-triples> ;
] ;
dcat:distribution [
rdf:type dcat:Distribution ;
dcterms:identifier "isc2017.rdf" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:byteSize "1600569"^^xsd:decimal ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/rdf+xml> ;
] ;
dcat:distribution [
rdf:type dcat:Distribution ;
dcterms:identifier "isc2017.ttl" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:byteSize "531703"^^xsd:decimal ;
dcat:mediaType <https://www.iana.org/assignments/media-types/text/turtle> ;
] ;
dcterms:references [
rdf:type dcat:Dataset ;
dcterms:title "Geological timescale ontology" ;
dcterms:type owl:Ontology ;
dcat:distribution [
dcterms:identifier "timescale.zip" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/zip> ;
] ;
] ;
dcterms:rights [
rdf:type dcterms:RightsStatement ;
rdfs:comment "All Rights (including copyright) CSIRO 2018." ;
] ;
dcterms:title "RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
dcat:contactPoint <https://people.csiro.au/C/S/Simon-Cox> ;
dcat:keyword "GeoSPARQL" , "OWL" , "OWL-Time" , "RDF" , "SOSA" , "SSN" , "geologic timescale" , "reference system" , "stratigraphy" , "vocabulary" ;
dcat:landingPage <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/information-engineering-and-theory> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/interorganisational-information-systems-and-web-services> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/stratigraphy> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/web-technologies> ;
.
... and this example is where the set of files are actually representations of parts of the dataset:
First, just using dct:relation
dap:atnf-P366-2003SEPT
rdf:type dcat:Dataset ;
dcterms:accessRights [
rdf:type dcterms:RightsStatement ;
rdfs:comment "The metadata and files (if any) are available to the public." ;
] ;
dcterms:bibliographicCitation "Burgay, M; McLaughlin, M; Kramer, M; Lyne, A; Joshi, B; Pearce, G; D'Amico, N; Possenti, A; Manchester, R; Camilo, F (2017): Parkes observations for project P366 semester 2003SEPT. v1. CSIRO. Data Collection. https://doi.org/10.4225/08/598dc08d07bb7" ;
dcterms:description "Parkes multibeam high-latitude pulsar survey" ;
dcterms:identifier "https://doi.org/10.4225/08/598dc08d07bb7"^^xsd:anyURI ;
dcterms:identifier "ivo://au.csiro.atnf/P366-2003SEPT"^^xsd:anyURI ;
dcterms:identifier [
rdf:type adms:Identifier ;
dcterms:creator <https://www.doi.org/> ;
skos:notation "10.4225/08/598dc08d07bb7" ;
adms:schemeAgency "International DOI Foundation" ;
] ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcterms:modified "2017-07-30T08:55:55Z"^^xsd:dateTime ;
dcterms:relation [ dcterms:identifier "PH0090_0011.sf" ; ] ;
dcterms:relation [ dcterms:identifier "PH0090_0021.sf" ; ] ;
dcterms:relation [ dcterms:identifier "PH0090_0031.sf" ; ] ;
dcterms:rights [
rdf:type dcterms:RightsStatement ;
rdfs:comment "All Rights (including copyright) CSIRO 2017." ;
] ;
dcterms:temporal [
rdf:type dcterms:PeriodOfTime ;
rdf:type time:ProperInterval ;
time:hasBeginning [
rdf:type time:Instant ;
time:inXSDDate "2003-09-01"^^xsd:date ;
] ;
time:hasEnd [
rdf:type time:Instant ;
time:inXSDDate "2003-12-31"^^xsd:date ;
] ;
] ;
dcterms:title "Parkes observations for project P366 semester 2003SEPT" ;
dcat:contactPoint [
rdf:type v:Individual ;
v:fn "Marta Burgay" ;
v:hasEmail <mailto:burgay@oa-cagliari.inaf.it> ;
] ;
dcat:keyword "pulsar" ;
dcat:landingPage <https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT> ;
dcat:theme <http://registry.it.csiro.au/def/keyword/anzsrc/astronomical-and-space-sciences-not-elsewhere-classified> ;
.
And using more precise semantics, Since the files are each a representation of part of the dataset, they are described as distributions of (anonymous) datasets which are linked using the dct:hasPart
relationship:
dap:atnf-P366-2003SEPT_1 rdf:type dcat:Dataset ; dcterms:accessRights [ rdf:type dcterms:RightsStatement ; rdfs:comment "The metadata and files (if any) are available to the public." ; ] ; dcterms:bibliographicCitation "Burgay, M; McLaughlin, M; Kramer, M; Lyne, A; Joshi, B; Pearce, G; D'Amico, N; Possenti, A; Manchester, R; Camilo, F (2017): Parkes observations for project P366 semester 2003SEPT. v1. CSIRO. Data Collection. https://doi.org/10.4225/08/598dc08d07bb7" ; dcterms:description "Parkes multibeam high-latitude pulsar survey" ; dcterms:hasPart [ rdf:type dcat:Dataset ; dcat:distribution [ rdf:type dcat:Distribution ; dcterms:identifier "PH0090_0011.sf" ; dcterms:license https://creativecommons.org/licenses/by/4.0/ ; dcat:accessURL https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT ; dcat:byteSize "1000000000"^^xsd:decimal ; ] ; ] ; dcterms:hasPart [ rdf:type dcat:Dataset ; dcat:distribution [ rdf:type dcat:Distribution ; dcterms:identifier "PH0090_0021.sf" ; dcterms:license https://creativecommons.org/licenses/by/4.0/ ; dcat:accessURL https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT ; dcat:byteSize "402000000"^^xsd:decimal ; ] ; ] ; dcterms:hasPart [ rdf:type dcat:Dataset ; dcat:distribution [ rdf:type dcat:Distribution ; dcterms:identifier "PH0090_0031.sf" ; dcterms:license https://creativecommons.org/licenses/by/4.0/ ; dcat:accessURL https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT ; dcat:byteSize "82000000"^^xsd:decimal ; ] ; ] ; dcterms:identifier "https://doi.org/10.4225/08/598dc08d07bb7"^^xsd:anyURI ; dcterms:identifier "ivo://au.csiro.atnf/P366-2003SEPT"^^xsd:anyURI ; dcterms:identifier [ rdf:type adms:Identifier ; dcterms:creator https://www.doi.org/ ; skos:notation "10.4225/08/598dc08d07bb7" ; adms:schemeAgency "International DOI Foundation" ; ] ; dcterms:modified "2017-07-30T08:55:55Z"^^xsd:dateTime ; dcterms:rights [ rdf:type dcterms:RightsStatement ; rdfs:comment "All Rights (including copyright) CSIRO 2017." ; ] ; dcterms:temporal [ rdf:type dcterms:PeriodOfTime ; rdf:type time:ProperInterval ; time:hasBeginning [ rdf:type time:Instant ; time:inXSDDate "2003-09-01"^^xsd:date ; ] ; time:hasEnd [ rdf:type time:Instant ; time:inXSDDate "2003-12-31"^^xsd:date ; ] ; ] ; dcterms:title "Parkes observations for project P366 semester 2003SEPT" ; dcat:contactPoint [ rdf:type v:Individual ; v:fn "Marta Burgay" ; v:hasEmail mailto:burgay@oa-cagliari.inaf.it ; ] ; dcat:keyword "pulsar" ; dcat:landingPage https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT ; dcat:theme http://registry.it.csiro.au/def/keyword/anzsrc/astronomical-and-space-sciences-not-elsewhere-classified ; .
@dr-shorthair, some questions:
dcterms:relation [ dcterms:identifier "isc2017.ttl" ; ] ;
. I see this is a blank node, but I can't see what the referenced resource is or where it can be accessed. Should there not be a link to the file, rather than just the identifier?@makxdekkers, some responses:
The repository that these examples come from does not assign external identifiers to the files/elements. Download access is mediated by a form. So this identification method was the best I could come up with.
Correct. As explained in the (revised) commentary above, these files are representations of parts of the dataset. Representations are usually modeled as dcat:Distribution
. My sense is that a dct:hasPart
relationship should be between dcat:Datasets
. So I tried to respect these various issues using blank nodes for the notional (undescribed) datasets which have distributions that are the actual files.
These are real examples from CSIRO's Data Access Portal (DAP). The DCAT descriptions are, however, manually constructed by me. In the first description for each one I have not use any information that is not already in the metadata in the DAP. It is not perfectly aligned with DCAT, but is a real repository. The goal of this issue is to propose that we develop guidelines for such imperfect 'legacy' repositories.
The landing page URLs do work, so you can inspect the sources for yourself. https://data.csiro.au/dap/landingpage?pid=csiro:33937 https://data.csiro.au/dap/landingpage?pid=csiro:P366-2003SEPT
@makxdekkers re the
CKAN extension – see, e.g.: http://data.jrc.ec.europa.eu/dataset/jrc-predict-predict2017-core
Could we get this example in DCAT? I can't find the API specification to pull it down.
@dr-shorthair We'll need to ask @andrea-perego. I have no access to the back-end of the JRC catalogue.
Thanks @dr-shorthair. Here an example where distributions where used for a case of multiple files, as there was no other way of representing this.
The example, as provided by the catalogue, is actually in schema.org, but pretty much there is a 1-to-1 mapping.
[] a schema:Dataset ;
schema:creator [ a schema:Organization ;
schema:name "Ofsted" ] ;
schema:dateModified "2016-12-12T14:16:44.522Z"^^schema:Date ;
schema:description "The outstanding providers list includes early years registered providers, maintained schools, independent schools, colleges and providers of work-based learning, adult education and children?s social care. Two datasets are included: the first lists of all those providers who met the outstanding provider criteria in the most recent year for which data is available; the second is a list of all providers who have met the applicable criteria in any year since 1993. In the second list the year(s) in which that provider was included are also shown." ;
schema:distribution [ a schema:DataDownload ;
schema:contentUrl <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/481154/Outstanding_Providers_List_1993-2014.csv> ;
schema:fileFormat <CSV> ;
schema:name "Outstanding Providers list 1993-2014" ],
[ a schema:DataDownload ;
schema:contentUrl <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/480700/Outstanding_providers_list_2014-15.csv> ;
schema:fileFormat <CSV> ;
schema:name "Outstanding Providers list 2014-2015" ],
[ a schema:DataDownload ;
schema:contentUrl <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/571915/Outstanding_Providers_List_2015-16.ods> ;
schema:fileFormat <ODS> ;
schema:name "Outstanding Providers list 2015-2016" ] ;
schema:includedInDataCatalog [ a schema:DataCatalog ;
schema:url <https://data.gov.uk/> ] ;
schema:keywords "Education" ;
schema:license [ a schema:CreativeWork ;
schema:name "Open Government Licence" ;
schema:url <http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/> ] ;
schema:name "Outstanding providers list" ;
schema:url <https://data.gov.uk/dataset/63f9c959-00b6-4c51-b165-47f387ff7881/outstanding-providers-list> .
Here goes an attempt to use dcterms:relation
instead:
[] a dcat:Dataset ;
dcat:publisher [ a foaf:Organization ;
rdfs:label "Ofsted" ] ;
dct:modified "2016-12-12T14:16:44.522Z"^^schema:Date ;
dct:description "The outstanding providers list includes early years registered providers, maintained schools, independent schools, colleges and providers of work-based learning, adult education and children?s social care. Two datasets are included: the first lists of all those providers who met the outstanding provider criteria in the most recent year for which data is available; the second is a list of all providers who have met the applicable criteria in any year since 1993. In the second list the year(s) in which that provider was included are also shown." ;
dcterms:relation [
dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/481154/Outstanding_Providers_List_1993-2014.csv> ;
dcat:mediaType "text/csv" ;
dct:title "Outstanding Providers list 1993-2014" ],
[
dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/480700/Outstanding_providers_list_2014-15.csv> ;
dcat:mediaType "text/csv" ;
dct:title "Outstanding Providers list 2014-2015" ],
[
dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/571915/Outstanding_Providers_List_2015-16.ods> ;
dct:format <ODS> ;
dct:title "Outstanding Providers list 2015-2016" ] ;
dcat:keywords "Education" ;
dcterms:license [
dct:title "Open Government Licence" ;
schema:url <http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/> ] ;
dct:title "Outstanding providers list" ;
dct:identifier <https://data.gov.uk/dataset/63f9c959-00b6-4c51-b165-47f387ff7881/outstanding-providers-list> .
So, my questions/comments would be:
dcterms:relation
in this way to point to multiple files that are not really distributions is simple and useful way to cover the use case, which wasn't cover in DCAT beforedcat:downloadURL
above, but this is wrong as it has domain dcat:Distribution
- what property to use instead? dcat:accessURL
is also for distributions. dcterms:relation
in this way, it is quite likely that developers would choose this simple representation even when the use of dcat:distribution
would be appropriate; so, do we need to encourage the use of the richer semantics representation as per @dr-shorthair examples (through guidance documentation in the spec, a primer, examples, etc) and what would be the consequences of people using the simple representation instead? @agbeltran Thanks for the example. This is exactly what I would not like to be allowed or encouraged by DCAT, as what you describe can be perfectly well represented as 3 datasets (each with a different temporal coverage and one distribution), and after the DCAT revision, hopefully, using a 4th dataset having these 3 as parts (i.e. dataset series).
The issues that you describe, i.e. properties having dcat:Distribution
as domain, I see as a natural consequence of insufficient metadata description, not something that should be supported, which would probably lead to further relaxation of the domains, and therefore greater mess in DCAT data.
As I stated earlier, I do not see the value of allowing representation of "just a bag of files" and I would rather encourage publishers to describe the files properly rather than creating messy DCAT data.
@dr-shorthair Regarding your usage of blank nodes, coming from the Linked Data community, I would discourage their usage. Simply everything should have an IRI, according to the basic Linked Data principles. No one can anticipate that there will be no interest to link to, e.g. parts of datasets (or datasets in a dataset series, which I think is the same thing). Furthermore, I would object to stating that dataset parts should inherit some properties from their parent dataset, as again this is messier to consume.
@jakubklimek I understand your concern about the blank nodes. In this issue I was tackling a separate question: the lack of guidance on how to represent the information in many existing catalogs, and the consequent mis-use of the dcat:distribution
property. The examples above are merely concerned with getting the modeling right. The key point is to propose that dct:hasPart
relationships should be to other datasets, not to distributions.
Best practice would certainly be to identify and describe them in their own right. However, as we have no more information available in the catalog that I was quoting from, I was just making sure that the model was correct first.
We have already heard that existing catalogs commonly use blank nodes for Distributions
. So we should probably tackle recommendations around blank nodes generally in a separate issue. Perhaps you can create that?
@agbeltran Great. The next step could be to use dct:hasPart
- a sub-property of dct:relation
to finish the job.
a dcat:Dataset ;
dcat:publisher [ a foaf:Organization ;
rdfs:label "Ofsted" ] ;
dct:modified "2016-12-12T14:16:44.522Z"^^schema:Date ;
dct:description "The outstanding providers list includes early years registered providers, maintained schools, independent schools, colleges and providers of work-based learning, adult education and children?s social care. Two datasets are included: the first lists of all those providers who met the outstanding provider criteria in the most recent year for which data is available; the second is a list of all providers who have met the applicable criteria in any year since 1993. In the second list the year(s) in which that provider was included are also shown." ;
dct:hasPart [
a dcat:Dataset ;
dcat:distribution [
dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/481154/Outstanding_Providers_List_1993-2014.csv> ;
dcat:mediaType "text/csv" ;
dct:title "Outstanding Providers list 1993-2014" ] ;
] ;
dct:hasPart [
a dcat:Dataset ;
dcat:distribution [
dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/480700/Outstanding_providers_list_2014-15.csv> ;
dcat:mediaType "text/csv" ;
dct:title "Outstanding Providers list 2014-2015" ] ;
] ;
dct:hasPart [
a dcat:Dataset ;
dcat:distribution [
dcat:downloadURL <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/571915/Outstanding_Providers_List_2015-16.ods> ;
dct:format <ODS> ;
dct:title "Outstanding Providers list 2015-2016" ] ;
] ;
dcat:keywords "Education" ;
dct:license [
dct:title "Open Government Licence" ;
schema:url <http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/> ] ;
dct:title "Outstanding providers list" ;
dct:identifier <https://data.gov.uk/dataset/63f9c959-00b6-4c51-b165-47f387ff7881/outstanding-providers-list> .
@dr-shorthair Thanks for the clarification, now I think we are on the same page.
Regarding blank nodes, I created #300.
+1 on the usage of dcterms:hasPart
in @agbeltran example creating a dataset series (important for #81).
The global domain constraints on dcat:accessURL
and dcat:mediaType
entail that the resources entitled "ChronostratChart2017-02.pdf" and "timescale.zip" in the example above are both of type dcat:Distribution
, although their relationship to the Dataset is not dcat:distribution
.
Is this OK?
@dr-shorthair This is interesting, and I think it is not OK. This leads to the question of whether dcat:Distribution
s can exist independently of datasets - i.e. distributions which are no part of any dataset. I would say they cannot... they are by definition distributions of a dataset.
Next question is, whether your referenced files are distributions of another dataset and if so, which one? But then dcterms:references
and dcterms:isFormatOf
would connect a dataset to another dataset's distribution, which I think is not right.
Or, the example needs to be expanded, and these relations would connect to a dataset, which would have to have a distribution, like this:
<dataset> dcterms:references [ a dcat:Dataset;
dcat:distribution [
dcterms:identifier "timescale.zip" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/zip>
]
] ;
Regarding the isFormatOf
relation, since it is defined as A related resource that is substantially the same as the described resource, but in another format.
, I would see this as a relation between two distributions (those have formats), not datasets, which are independent of formats.
I agree with @jakubklimek that it feels wrong. It seems to me that a Distribution is supposed to distribute something. The definition of Distribution says it "Represents a specific available form of a dataset", so there must be a connection to a Dataset, and that connection is modelled using dcat:distribution
. How then to relate the timescale.zip
file to the Dataset depends on the role of that file in relation to the Dataset. It is not obvious from the example that the file is a distribution of some other dataset. If it is, then @jakubklimek's suggestion might work, but otherwise maybe using more general properties that do not infer that the file is a distribution of anything:
dcterms:references [
dcterms:identifier "timescale.zip" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
foaf:page <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcterms:format <https://www.iana.org/assignments/media-types/application/zip>
OK - I've updated the example above to interpose a dcat:Dataset
node in front of the distributions (currently a blank node - sorry @jakubklimek).
In the case of the dct:isFormatOf
relation, the current resource is an RDF dataset, while the predecessor is an Image. The RDF is a re-formulation of the data on the image. Perhaps there is a better predicate than dcat:isFormatOf
?
For the time being, I've added an additional Distribution
of the image to reinforce the message that this relationship is between datasets, each of which can have multiple representations:
dap:d33937
rdf:type dcat:Dataset ;
dcterms:title "RDF representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
dcterms:isFormatOf [
rdf:type dcat:Dataset ;
dcterms:source <http://stratigraphy.org/index.php/ics-chart-timescale> ;
dcterms:title "Graphical representation of 2017 edition of International Chronostratigraphic Chart (Geologic Timescale)" ;
dcterms:type dctype:Image ;
dcat:distribution [
rdf:type dcat:Distribution ;
dcterms:identifier "ChronostratChart2017-02.jpg" ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:byteSize "1629104"^^xsd:decimal ;
dcat:mediaType <https://www.iana.org/assignments/media-types/image/jpeg> ;
] ;
dcat:distribution [
rdf:type dcat:Distribution ;
dcterms:identifier "ChronostratChart2017-02.pdf" ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:byteSize "296233"^^xsd:decimal ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/pdf> ;
] ;
] ;
.
In the case of the dct:references
relation the target has type owl:Ontology
.
dap:d33937
rdf:type dcat:Dataset ;
dcterms:references [
rdf:type dcat:Dataset ;
dcterms:title "Geological timescale ontology" ;
dcterms:type owl:Ontology ;
dcat:distribution [
dcterms:identifier "timescale.zip" ;
dcterms:license <https://creativecommons.org/licenses/by/4.0/> ;
dcat:accessURL <https://data.csiro.au/dap/landingpage?pid=csiro:33937> ;
dcat:mediaType <https://www.iana.org/assignments/media-types/application/zip> ;
] ;
] ;
.
Since the latter case refers to an OWL ontology serialized in Turtle packaged in a zip archive, it will need updating when we resolve #259 .
Resolved in https://www.w3.org/2018/07/19-dxwgdcat-minutes#x09 See PR #295
@dr-shorthair wrote:
@makxdekkers re the
CKAN extension – see, e.g.: http://data.jrc.ec.europa.eu/dataset/jrc-predict-predict2017-core
Could we get this example in DCAT? I can't find the API specification to pull it down.
Sorry, @dr-shorthair & @makxdekkers , for not replying earlier. Here's the relevant RDF (abridged):
<http://data.europa.eu/89h/jrc-predict-predict2017-core>
a dcat:Dataset ;
dcterms:accrualPeriodicity <http://publications.europa.eu/resource/authority/frequency/IRREG> ;
dcterms:description """PREDICT includes statistics on ICT industries and their R&D in Europe since 2006. [...]"""@en ;
dcterms:identifier "jrc-predict-predict2017-core" ;
dcterms:isReferencedBy <https://doi.org/10.2760/397817>, <https://doi.org/10.2760/63665> ;
dcterms:issued "2017-05-10"^^xsd:date ;
dcterms:language <http://publications.europa.eu/resource/authority/language/ENG> ;
dcterms:modified "2017-05-10"^^xsd:date ;
dcterms:publisher <http://publications.europa.eu/resource/authority/corporate-body/JRC> ;
dcterms:relation [
dcterms:description "PREDICT webpage (European Commission - JRC Science Hub)"@en ;
dcterms:format <http://publications.europa.eu/resource/authority/file-type/HTML> ;
dcterms:title "Prospective insights on R&D in ICT (PREDICT)"@en ;
dcat:accessURL <https://ec.europa.eu/jrc/en/predict>
] ;
dcterms:spatial <http://publications.europa.eu/resource/authority/continent/AFRICA>, <http://publications.europa.eu/resource/authority/continent/AMERICA>, <http://publications.europa.eu/resource/authority/continent/ANTARCTICA>, <http://publications.europa.eu/resource/authority/continent/ASIA>, <http://publications.europa.eu/resource/authority/continent/EUROPE>, <http://publications.europa.eu/resource/authority/continent/OCEANIA> ;
dcterms:subject <http://eurovoc.europa.eu/100146>, <http://eurovoc.europa.eu/100151> ;
dcterms:temporal [
a dcterms:PeriodOfTime ;
schema:endDate "2016-12-31"^^xsd:date ;
schema:startDate "1995-01-01"^^xsd:date
] ;
dcterms:title "2017 PREDICT Dataset"@en ;
dcat:contactPoint [
a vcard:Kind ;
vcard:hasEmail <mailto:montserrat.lopez-cobo@ec.europa.eu>
] ;
dcat:distribution [
a dcat:Distribution ;
dcterms:accessRights <http://data.jrc.ec.europa.eu/access-rights/no-limitations> ;
dcterms:description "The compressed zip file contains two Excel files splitting the complete 2017 PREDICT Dataset into: macroeconomic variables and R&D related variables."@en ;
dcterms:format <http://publications.europa.eu/resource/authority/file-type/XLS> ;
dcterms:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
dcterms:title "2017 PREDICT Dataset, Excel file"@en ;
dcat:accessURL <https://ec.europa.eu/jrc/sites/jrcsh/files/2017_predict_core_dataset_xlsx.zip>
], [
a dcat:Distribution ;
dcterms:accessRights <http://data.jrc.ec.europa.eu/access-rights/no-limitations> ;
dcterms:description "The compressed zip file contains a CSV file including the complete 2017 PREDICT Dataset"@en ;
dcterms:format <http://publications.europa.eu/resource/authority/file-type/CSV> ;
dcterms:license <http://publications.europa.eu/resource/authority/licence/COM_REUSE> ;
dcterms:title "2017 PREDICT Dataset, CSV file"@en ;
dcat:accessURL <https://ec.europa.eu/jrc/sites/jrcsh/files/2017_predict_core_dataset_csv.zip>
] ;
dcat:keyword "ICT R&D and innovation"@en, "ICT industry analysis"@en, "ICT"@en, "R&D"@en, "digital economy"@en, "information society"@en, "innovation"@en, "statistics"@en ;
dcat:landingPage <https://ec.europa.eu/jrc/en/predict/ict-sector-analysis-2017/data-metadata> ;
dcat:theme <http://publications.europa.eu/resource/authority/data-theme/ECON>, <http://publications.europa.eu/resource/authority/data-theme/TECH> .
@dr-shorthair raised this in the mailing list:
I had a few concerns regarding this proposal:
dcat:downloadURL
, I would disagree with the possibility to allow linking them directly from adcat:Dataset
record, as this would create mess everywhere where a publisher would be a bit lazy to describe the data properly.dcat:distribution
in a wrong way mainly due to the lack of support for dataset series, which is being resolved in this DCAT revision. When this support is added, publishers will have the possibility of modeling many use cases correctly.