w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
150 stars 47 forks source link

Monthly DBpedia releases #1085

Open kurzum opened 5 years ago

kurzum commented 5 years ago

DBpedia Releases

Status: Identifier: https://databus.dbpedia.org/dbpeda/ Creator: Sebastian Hellmann

Description

We are releasing several thousand files per month now and I have specific questions about dcat:Distribution .

I our case group each version according to the generating Scala code: https://databus.dbpedia.org/dbpedia/mappings/mappingbased-objects/2019.09.01 In this example, each month the code is run over 40 different wikipedia dumps and generates 40 different files according to their language variant. All these files together make up the dataset and each file is a partial distribution. See the metadata here: https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/dataid.ttl#Dataset

I could not find an appropriate model in the current draft to describe this properly. It is more structured than the bag of file approach as the data uses the Maven model with group/artifact/version and then content/format/compression - variants.

Note that we consider language/different source a variant. All files make up the version snapshot dataset, while you would only need a subset of files for any given use case. A similar example would be the split of files into consequent compressed parts (e.g. 20 * 50mb of 1GB data) with the difference that you would need all files there to get the distribution. How would this be modelled in the current draft?

makxdekkers commented 5 years ago

@kurzum From a cursory look at the metadata at https://downloads.dbpedia.org/repo/lts/mappings/mappingbased-objects/2019.09.01/dataid.ttl#Dataset, it looks to me that your dataid:SingleFile is very similar to dcat:Distribution. If it is the case that all files contain the same data in different languages, they could be modelled as separate dcat:Distributions under one dcat:Dataset. If they contain different data, they could be modelled as different dcat:Datasets.

kurzum commented 5 years ago

The problem is the fine line between same and different. Here, the main thing all distributions have in common is that they were created by the same code in the same activity. Content-wise they are true variants of each other. All of them make up the dataset, but they are useful individually and in combination. dataid:SingleFile is already subclassOf dcat:Distribution, but we might should switch to dataid:FileCollection to better model the semantics. DCAT 2 seems to evolve in the DataID (https://wiki.dbpedia.org/projects/dbpedia-dataid) direction, which is quite good.

I am specifically asking here, because we host all the metadata of 5k monthly files in a sparql endpoint: https://databus.dbpedia.org/yasgui/

Having an extra dataset node for each file would be infeasible and impractical. For us it would be helpful to have a better definition of variants. But we can also create one ourselves as an extension.

makxdekkers commented 5 years ago

Indeed, there is a fine line between same and different. It's maybe useful to think from the perspective of a 'general' user, a person who is not aware of the way data is produced and how it is structured. While I guess that your regular users know what to expect, a non-initiated user might rightly expect that a dataset has distributions that all contain the same data. In fact, the current DCAT draft says in section 6.7: "all distributions of one dataset should broadly contain the same data". Maybe your use case could be added for consideration for the next version of DCAT?

kurzum commented 5 years ago

Maybe your use case could be added for consideration for the next version of DCAT? hm, I was under the impression, that I already added it for DCAT 2.0 by posting it here. What else do I need to do?

all distributions of one dataset should broadly contain the same data -> still true. They are variants of the same data. You can fuse them consistently into one as well. The definition really depends on the part of the real world the data is supposed to describe, right? So in a person dataset, distributions could be partitioned alphabetically.

General users of DBpedia - We tried to figure out what it means to be a general DBpedia user. Our conclusion is that we don't have those. They all want a different partition of the data. Hence, the popularity of the SPARQL endpoint. We separated the technical file layer from the ability to create collections (which are dcat:Catalogue).

makxdekkers commented 5 years ago

@kurzum The content of DCAT 2 is frozen. We're about to transition to Candidate Recommendation. In this phase we can't include new use cases, so this could be on the list for DCAT 3.

As far as I can tell, 'partitioning of distributions' is not currently something that DCAT supports. The current note in section 6.7 uses the example of budget data for different years where you could imagine partitioning the distributions per year. However, the draft suggests that those 'would usually be' modelled as different datasets. So if a future version of DCAT wants to formalise the partitioning of Distributions, there is some modelling work to do.

kurzum commented 5 years ago

Ok, this works for us in a sense. We consider this a contentVariant by time, whereas versions would be tied to updates of the dataset. There are good reasons to use distributions here. So this is a SHOULD in terms of the standard.

I guess DCAT 2 still doesn't tackle abstract dataset identity.

I will check 6.7 more closely. Does DCAT 2 use SHACL for anything?

kurzum commented 5 years ago

Thanks for the good explanation

dr-shorthair commented 5 years ago

If each of the dumps is intended to be a representation of the same conceptual dataset, then even if the content is different (because they have different time-stamps) then they can still all be legitimately considered to be 'distributions' of that dataset. The dcat:distribution relationship is mostly about intention.

But I see that your application has some axes of complexity. There are some relevant tools in DCAT2:

I suspect that these might provide a basis for describing your data. But it likely would be more reproducible if there were a couple more classes, something like:

makxdekkers commented 5 years ago

@dr-shorthair Yes, that's what I meant by "there is some modelling work to do".

kurzum commented 5 years ago

Here is the model which we will adopt:

<https://boa.lmcloud.vse.cz/databus/linked-hypernyms/2016.04.01/dataid.ttl#Dataset>
       # subclass of dcat:Dataset
        a                       dataid:Dataset ;
        dataid:account          databus:propan ;
        dataid:group            <https://databus.dbpedia.org/propan/lhd> ;
        dataid:artifact         <https://databus.dbpedia.org/propan/lhd/linked-hypernyms> ;
        dataid:version          <https://databus.dbpedia.org/propan/lhd/linked-hypernyms/2016.04.01> ;
        dct:hasVersion          "2016.04.01" ;

Order will be lexicographically over the version string, so it goes with SPARQL Order by.

Then we will create:

dataid:DatabusDistribution rdfs:subClassOf dcat:Distribution.

which will have:

This way datasets will be flat and can be aggregated and queried more easily.

andrea-perego commented 3 years ago

@kurzum , do you have any further point you would like to discuss? Otherwise, we are going to close this issue.

kurzum commented 3 years ago

@andrea-perego DCAT 2 failed to address this. It is still a huge gap. Are you moving things to DCAT 3?

andrea-perego commented 3 years ago

@kurzum said:

@andrea-perego DCAT 2 failed to address this. It is still a huge gap. Are you moving things to DCAT 3?

DCAT3 is actually meant to include explicit support for versioning and dataset series.

The current drafts are available here:

Your comments would be welcome.