w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
153 stars 47 forks source link

DCAT property for subsets ? #1527

Open dzkwsk opened 2 years ago

dzkwsk commented 2 years ago

The content of a statistical classification evolves over time with explanatory notes for items that may change slightly and have successive versions. While it is important to point directly to the current version of all items in the classification, it is also relevant to obtain the history of all items. It would therefore be useful to be able to distinguish two dcat:distributions which correspond to the current notes of a classification on the one hand and to the whole history on the other.

dcat:hasCurrentVersion may not be exactly what we need. This property could separate different versions that correspond to the same master object. In our case, it is more of a content restriction to the latest versions of the notes.

A subproperty of dcat:Distribution whose scope is a subset corresponding to the current contents of a dcat:dataset would be relevant. Or would dcat:hasCurrentVersion still make sense anyway? Deliverable(s): XKOS Best Practices

http://linked-statistics.github.io/xkos/xkos-best-practices.html#issue-container-number-12

smrgeoinfo commented 2 years ago

If a classification system (A) is coherent and covering (non-overlapping, no gaps) for its scope, then if an individual class (C1) in the system is updated such that it changes the classification of other entities, then the update is breaking, and the classification system with the new concept (C1) MUST be identified as a new classification system (B). The issue is that if an entity is a member of class x under system A, it is not necessarily a member of class x under system B.

versioning of Classification systems is very tricky!

dr-shorthair commented 2 years ago

Correct @smrgeoinfo

The statistical agencies are generally all across this. However DCAT is probably incomplete since I don't think we had anyone with expertise in official statistics in the conversations.

nicholascar commented 2 years ago

I urge the DCAT editors to defer fine-grained versioning & change details to Dataset modelling and to not to try to cater for them at the DCAT the metadata level.

Consider: a large dataset like the Australian Address Database has addresses added and removed every few months, so should it have a long list of Distributions? No! The Dataset is the overall thing and Address addition/removal/change is annotated at the Feature (sub-Dataset) level since it is complex and knowledge about what an 'Address' is - a sub-Dataset element - is needed to correctly use such information.

My motivation for stepping in here is that I would hate to see DCAT get too expressive: more skill in the vocabulary will harm adoption for simple catalogues given the perception of it being "heavyweight" and broad adoption is more important to me that deep skill.

Anyway, there are already many Semantic Web ways to model versioning issues (e.g. PAV) that are DCAT-compatible. So use DCAT for the catalogue and drop down into fine-grained versioning in PAV, SDMX/QB etc. as needed.

tfrancart commented 2 years ago

If a classification system (A) is coherent and covering (non-overlapping, no gaps) for its scope, if an individual class (C1) in the system is updated such that it changes the classification of other entities, then the update is breaking, and the classification system with the new concept (C1) MUST be identified as a new classification system (B)

I agree but this is NOT what the original use-case describes. The use-case is that a class in a given classification system A is described with explanatory notes, and these explanatory notes changes over time, but this does not lead a reclassification of entities, so we are not creating a new classification system B. And the history of the notes is kept.

So basically, the question is what would be the recommended practice between:

  1. Keeping a single Dataset and multiple distributions, some distributions with full note history, some distributions without complete history of notes (just the latest version). And in that case how to identify/link/tag note-history-complete-distributions vs. only-current-note-version-distributions.
  2. Declaring 2 different Datasets : ex:classificationA-WithFullNoteHistory and ex:classificationA-WithOnlyMostCurrentNotes, and how to identify/link/tag those 2 datasets

For more details and regarding what XKOS suggests in terms of versioning of notes in statistical classification, see http://linked-statistics.github.io/xkos/xkos-best-practices.html#bp-notes-versioning-timestamping

smrgeoinfo commented 2 years ago

Just my opinion, but to me, if the changes do not cause reclassification of entities (or introduction of new subcategories), then it would make sense to me to have one distribution with all the notes (assuming they are time stamped in some way).

tfrancart commented 2 years ago

Just my opinion, but to me, if the changes do not cause reclassification of entities (or introduction of new subcategories), then it would make sense to me to have one distribution with all the notes (assuming they are time stamped in some way).

Yes notes are timestamped (see link sent previously for details). Yes we want to have one distribution with all the notes. But we are also considering providing another distribution with only the most recent note of each concept, and not the full note history; we think it can be easier for data consumers if they don't have to query the note history to simply retrieve the current note.

And so the question is : would this be 2 distributions of the same dataset ? or 2 datasets (but this may not be practical for reusers) ? and how to identify/link/tag those distributions or datasets ?

riccardoAlbertoni commented 2 years ago

Before further studying, I noticed that http://linked-statistics.github.io/xkos/xkos-best-practices.html#issue-container-number-12 seems to have disappeared from your draft document. Should we assume you have already resolved your doubts?

Can we close this issue?

tfrancart commented 2 years ago

For the moment we have simply differed the answer and we have actually referred to this very issue; here is how the XKOS best practices document now reads:

These versions containing the history of all notes are considered as different distributions of the datasets. They should be described by specific properties. Here we use rdfs:comment for describing and discriminating the distributions, given a lack of other modeling alternatives in DCAT (as of september 2022; this was raised in this DCAT issue). Other information is given by the dcterms:temporal metadata that has a larger span for the distributions containing all the note changes.

Exact pointer : http://linked-statistics.github.io/xkos/xkos-best-practices.html#bp-publishing-classification

So, no I don't think the issue should be closed.