Open kurzum opened 5 years ago
@kurzum
From a cursory look at the metadata at https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/dataid.ttl#Dataset, it looks to me that your dataid:SingleFile
is very similar to dcat:Distribution
.
If it is the case that all files contain the same data in different languages, they could be modelled as separate dcat:Distribution
s under one dcat:Dataset
.
If they contain different data, they could be modelled as different dcat:Dataset
s.
The problem is the fine line between same and different. Here, the main thing all distributions have in common is that they were created by the same code in the same activity. Content-wise they are true variants of each other. All of them make up the dataset, but they are useful individually and in combination. dataid:SingleFile
is already subclassOf
dcat:Distribution
, but we might should switch to dataid:FileCollection
to better model the semantics. DCAT 2 seems to evolve in the DataID (https://wiki.dbpedia.org/projects/dbpedia-dataid) direction, which is quite good.
I am specifically asking here, because we host all the metadata of 5k monthly files in a sparql endpoint: https://databus.dbpedia.org/yasgui/
Having an extra dataset node for each file would be infeasible and impractical. For us it would be helpful to have a better definition of variants. But we can also create one ourselves as an extension.
Indeed, there is a fine line between same and different. It's maybe useful to think from the perspective of a 'general' user, a person who is not aware of the way data is produced and how it is structured. While I guess that your regular users know what to expect, a non-initiated user might rightly expect that a dataset has distributions that all contain the same data. In fact, the current DCAT draft says in section 6.7: "all distributions of one dataset should broadly contain the same data". Maybe your use case could be added for consideration for the next version of DCAT?
Maybe your use case could be added for consideration for the next version of DCAT?
hm, I was under the impression, that I already added it for DCAT 2.0 by posting it here. What else do I need to do?
all distributions of one dataset should broadly contain the same data
-> still true. They are variants of the same data. You can fuse them consistently into one as well. The definition really depends on the part of the real world the data is supposed to describe, right? So in a person dataset, distributions could be partitioned alphabetically.
General users of DBpedia
- We tried to figure out what it means to be a general
DBpedia user. Our conclusion is that we don't have those. They all want a different partition of the data. Hence, the popularity of the SPARQL endpoint. We separated the technical file
layer from the ability to create collections
(which are dcat:Catalogue
).
@kurzum The content of DCAT 2 is frozen. We're about to transition to Candidate Recommendation. In this phase we can't include new use cases, so this could be on the list for DCAT 3.
As far as I can tell, 'partitioning of distributions' is not currently something that DCAT supports. The current note in section 6.7 uses the example of budget data for different years where you could imagine partitioning the distributions per year. However, the draft suggests that those 'would usually be' modelled as different datasets. So if a future version of DCAT wants to formalise the partitioning of Distributions, there is some modelling work to do.
Ok, this works for us in a sense. We consider this a contentVariant by time, whereas versions would be tied to updates of the dataset. There are good reasons to use distributions here. So this is a SHOULD
in terms of the standard.
I guess DCAT 2 still doesn't tackle abstract dataset identity.
I will check 6.7 more closely. Does DCAT 2 use SHACL for anything?
Thanks for the good explanation
If each of the dumps is intended to be a representation of the same conceptual dataset, then even if the content is different (because they have different time-stamps) then they can still all be legitimately considered to be 'distributions' of that dataset. The dcat:distribution
relationship is mostly about intention.
But I see that your application has some axes of complexity. There are some relevant tools in DCAT2:
dcat:Dataset
level. Roles
. I suspect that these might provide a basis for describing your data. But it likely would be more reproducible if there were a couple more classes, something like:
dcat:DatasetSeries
(another sub-class of dcat:Resource
) - a sequence of datasets sharing most of the description but with just the temporal or spatial footprint differing (see #868 on the backlog).dcat:DistributionPackage
- a set of resources, which used together provide a representation of a Dataset - a richer version of bag-of-files@dr-shorthair Yes, that's what I meant by "there is some modelling work to do".
Here is the model which we will adopt:
dcat:DatasetSeries
, but with these properties as theyy follow the Maven POM:<https://boa.lmcloud.vse.cz/databus/linked-hypernyms/2016.04.01/dataid.ttl#Dataset>
# subclass of dcat:Dataset
a dataid:Dataset ;
dataid:account databus:propan ;
dataid:group <https://databus.dbpedia.org/propan/lhd> ;
dataid:artifact <https://databus.dbpedia.org/propan/lhd/linked-hypernyms> ;
dataid:version <https://databus.dbpedia.org/propan/lhd/linked-hypernyms/2016.04.01> ;
dct:hasVersion "2016.04.01" ;
Order will be lexicographically over the version string, so it goes with SPARQL Order by.
Then we will create:
dataid:DatabusDistribution rdfs:subClassOf dcat:Distribution.
which will have:
dataid:DatabusDistribution
s which together form the Distribution. dcat:DistributionPackage
but limited to files, which have contentVariant tags and different format and compressionvariantsThis way datasets will be flat and can be aggregated and queried more easily.
@kurzum , do you have any further point you would like to discuss? Otherwise, we are going to close this issue.
@andrea-perego DCAT 2 failed to address this. It is still a huge gap. Are you moving things to DCAT 3?
@kurzum said:
@andrea-perego DCAT 2 failed to address this. It is still a huge gap. Are you moving things to DCAT 3?
DCAT3 is actually meant to include explicit support for versioning and dataset series.
The current drafts are available here:
Your comments would be welcome.
DBpedia Releases
Status: Identifier: https://databus.dbpedia.org/dbpeda/ Creator: Sebastian Hellmann
Description
We are releasing several thousand files per month now and I have specific questions about
dcat:Distribution
.I our case group each version according to the generating Scala code: https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/2019.09.01 In this example, each month the code is run over 40 different wikipedia dumps and generates 40 different files according to their language variant. All these files together make up the dataset and each file is a partial distribution. See the metadata here: https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/dataid.ttl#Dataset
I could not find an appropriate model in the current draft to describe this properly. It is more structured than the bag of file approach as the data uses the Maven model with group/artifact/version and then content/format/compression - variants.
Note that we consider language/different source a variant. All files make up the version snapshot dataset, while you would only need a subset of files for any given use case. A similar example would be the split of files into consequent compressed parts (e.g. 20 * 50mb of 1GB data) with the difference that you would need all files there to get the distribution. How would this be modelled in the current draft?