w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
154 stars 47 forks source link

dcterms:hasPart in the context of nested Catalogs #1454

Closed andreasgeissner closed 2 years ago

andreasgeissner commented 2 years ago

Dear DCAT Team,

When trying to model our institutional research data repository, I have come across something that, from the eye of a semantic web amateur, looks inconsistencies with the definition of dcterms:hasPart in the DCAT 3 11 January 2022 public draft.

We have a tree-shaped system of nested categories in our DSpace repository, called communities, subcommunities, and collections that I want to model in DCAT. Any DSpace item (so bitstreams and metadata) is assigned to exactly one collection, which in turn belongs to exactly one subcommunity and so on. So this is unlike many other repositories where categories are more like additional subjects that can be mixed and matched.

Let’s say we have a category A that has two subcategories B and C. B and C contain (the metadata of) DSpace items B1,…,Bn and C1,…,Cn, respectively. A would then contain the metadata of both B1,…,Bn and C1,…,Cn. I would consider all of them to be (at least) dcat:Datasets, as they have metadata assigned to it and contain a clearly defined amount of data.

From a pure dataset perspective, the following should be possible to explicitly state which data is in which dataset and that A is a dataset that contains the data of B and C (not regarding the domain of dcat:dataset here)

B dcat:dataset B1,…,Bn .
C dcat:dataset C1,…Cn .
A dcterms:hasPart B, C .

This is because dcterms:hasPart can be used to split datasets into multiple subdatasets, according to how I understand “multi-part datasets” in Issue #1205. The higher level dataset should contain at least all the information/data the subdatasets do. The current version of example C.1(loosely structured catalog) uses this functionality.

However, §5.1 introduces a dcat:Catalog as “a dataset in which each individual item is a metadata record describing some resource”, meaning A, B, and C would be dcat:Catalogs as well. Of course, also because the domain of dcat:dataset is dcat:Catalog, but even if I could get around of using this, from a definition standpoint it seems to be inevitable for me.

In this context, dcterms:hasPart is defined as “An item that is listed in the catalog.”. Which is similar to the respective definitions of, for example, dcat:dataset or dcat:service. My interpretation is that the metadata of the item is an entry in the catalog, not the item data. Am I right? (https://github.com/w3c/dxwg/commits/gh-pages/dcat/rdf/dcat-external.ttl restricts this definition to DCAT 2.0, but I assume this just has not been updated unless I have overlooked anything)

A dcterms:hasPart B, C .

would then mean B and C are listed in A, and not their datasets. It seems a bit inconsistent to me that DCAT 3 was designed to allow for breaking into parts any kind of dataset unless it happens to only contain metadata records, meaning it being a dcat:Catalog. Do I misunderstand this, is there any intention to change this or is it intentionally designed to not be allowed to split dcat:Catalogs? A dcat:entry or something would make more sense for me for “An Item listed in the catalog”. Of course, backwards compatibility might be a big issue here.

You could explicitly use (losing information about the relation between A and B, C? As stated above, I’m not a semantic web expert, I don’t know what inference would do in this situation)

A dcat:dataset B1,…,Bn,C1,…Cn .
B dcat:dataset B1,…,Bn .
C dcat:dataset C1,…Cn .

I also don’t like this version in light of the statement for dcat:Catalog, that “A Web-based data catalog is typically represented as a single instance of this class”. Using dcterms:hasPart, you would still have one umbrella catalog with a clear structure so that you can look at parts of it. Here, you would just have a heap of catalogs that are not explicitly related.

Furthermore, not being able to model the subcategories as their own Datasets (and Datasets of metadata records are Catalogs) also would preclude linking exported information of the “subcatalogs”, e.g. as XML files, as “Catalog Distributions”.

Workarounds ways to get information on the category structure in the rdf data might be DatasetSeries (but they are designed for datasets that can be split in a predictable fashion) or having one Catalog and the categories as a themeTaxonomy. It would see a waste to lose it in RDF.

Thanks for your great work on the vocabulary!

Cheers, Andreas

smrgeoinfo commented 2 years ago

To figure this out, I'd need to understand what you're trying to do more clearly. to quote "We have a tree-shaped system of nested categories in our DSpace repository, called communities, subcommunities, and collections that I want to model in DCAT. Any DSpace item (so bitstreams and metadata) is assigned to exactly one collection, which in turn belongs to exactly one subcommunity and so on. " What do the items in a community, subcommunity, or collection have in common? Are these documents (e.g. pdf files?), or structured datasets. If they are datasets, do they have a common schema? Are all the documents in a 'community' about the same place, time interval, biological taxon...

andreasgeissner commented 2 years ago

Thank you very much!

I’ll try my best to explain. Please let me know if it’s insufficient or still not clear enough.

In general, the category order would be Community -> subcommunity -> subsubcommunity -> …-> collection

We are looking at research data from a university with a broad subject scope. The further we move up the tree, the more dissimilar the items would get. We plan to share our platform with other universities, so a community would be everything from one university (the top level dcat:Catalog). A subcommunity might be a department, for example social sciences, mathematics, chemistry, or mechanical engineering. A subsubcommunity might, for chemistry, represent the institutes of inorganic, organic, or analytical chemistry. The next level might be research groups, projects, etc. depending on how the institute and their part of the repository is organized.

Thus, I can’t provide a general pattern of similarity. But I can provide some more details with real life examples. Maybe this will help.

Collections might, for example, include a number of items that contain all data published in connection with a PhD thesis or all data published by a small research group or similar. Each item would have Dublin core style metadata and one or more files attached. These files might be tabular data, images, and audiovisual data, basically anything depending on what is used in the respective research area. So, for the DSpace item, we would have a dcat:Dataset with part datasets as “bag of files”.

For example, here is one collection that contains two items that are connected by originating from research funded by the same grant: Collection landing page: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2840?locale-attribute=en Item list: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2840/browse?locale-attribute=en XML “distribution”: https://tudatalib.ulb.tu-darmstadt.de/oai/openairedata?verb=ListRecords&metadataPrefix=oai_dc&set=col_tudatalib_2840 (There are other distributions than Dublin Core, just as example)

This research was done at the Institute for Mechatronic Systems. It is listed together with datasets from other collections in the subsubcommunity representing research of that institute Landing page: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/1448 Item list: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/1448/browse?type=title XML “distribution”: https://tudatalib.ulb.tu-darmstadt.de/oai/openairedata?verb=ListRecords&metadataPrefix=oai_dc&set=com_tudatalib_1448

The institute is part of the Department for Mechanical Engineering. So the items will be listed together with items from the other subsubcommunities (and their collections) Landing page: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/1436 Item list: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/1436/browse?type=title XML “distribution”: https://tudatalib.ulb.tu-darmstadt.de/oai/openairedata?verb=ListRecords&metadataPrefix=oai_dc&set=com_tudatalib_1436

At the top level, there is the Community for the whole university. This (most important) Catalog will show all items from all collections of any subcommunity. Landing page: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2544 Item list: https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2544/browse?type=title XML “distribution”: https://tudatalib.ulb.tu-darmstadt.de/oai/openairedata?verb=ListRecords&metadataPrefix=oai_dc&set=com_tudatalib_2544

I was also thinking about making the connection between a higher level Catalog and the lower level ones by saying that the higher level Catalog contains via dcat:catalog the lower level Catalogs and via dcat:dataset all the items from the lower level Catalogs (similar to what is seen on the landing pages linked above). But this clashes with the item lists and XML distributions linked above.

andrea-perego commented 2 years ago

Thanks for the additional details, @andreasgeissner .

The possible solutions depend on what exactly you plan to use DCAT for. Is it about having a DCAT representation inside your DSpace instance, or rather to expose a DCAT representation of your metadata from DSpace and/or from your OAI-PMH endpoint?

If it is related to DSpace, the first question is whether (sub-)communities should be actually represented as dcat:Catalog's or rather as org:Organization's hierarchically organised, each linked (either explicitly or implicitly) to a given set of collections (dcat:Catalog's).

If it is related to OAI-PMH, then the OAI-PMH notion of "set" can actually be mapped to dcat:Catalog.

About your question on the use of dcterms:hasPart, defined in DCAT as

An item that is listed in the catalog.

please note that the notion of "item" here is not the same of OAI-PMH, but it means any dcat:Resource (dataset, series, service, catalogue) that is documented in a given catalogue.

The link between a catalogue and its metadata records is instead specified in DCAT by using property dcat:record, defined as follows:

A record describing the registration of a single resource (e.g., a dataset, a data service) that is part of the catalog.

However, dcterms:hasPart can also be used to specify a containment relationship between catalogues, to build a hierarchical structure similar to the set hierarchy available from your OAI-PMH endpoint - e.g.:

flowchart TB
A("dcat:Catalog<br>Technische Universität Darmstadt")-- dcterms:hasPart -->B1("dcat:Catalog<br>16 Fachbereich Maschinenbau")
A-- dcterms:hasPart -->B2("dcat:Catalog<br>...")
B1-- dcterms:hasPart -->C1("dcat:Catalog<br>Mechatronische Systeme im Maschinenbau (IMS)")
B1-- dcterms:hasPart -->C2("dcat:Catalog<br>...")
C1-- dcterms:hasPart -->D1("dcat:Catalog<br>DFG Project AMOS – 435227428")
C1-- dcterms:hasPart -->D2("dcat:Catalog<br>...")
D1-- dcat:dataset -->E1("dcat:Dataset<br>Supplementary data: Active vibration <br> control of an elastic rotor by using <br> its deformation as controlled variable")
D1-- dcat:dataset -->E2("dcat:Dataset<br>Supplementary data: Active vibration <br> control of a gyroscopic rotor using <br> experimental modal analysis")

Does this answer your questions?

andreasgeissner commented 2 years ago

Thanks for the detailed response, @andrea-perego . I had typed but not yet proofread a lengthy text to clarify my concern (which is copied below in case you need information out of it for the new issue), but Issue #1469 is exactly the solution i imagined. Thanks again!


The issue I wanted to have addressed is indeed the definition of dcterms:hasPart.

I think you understood correctly what I was trying to do (modelling the metadata contents of our repository in DCAT, and clearly stating the hierarchy of all catalogs and which dataset is listed in which (sub-)catalog). But I’m not sure that I correctly understand your response or that you saw the exact issue that I was trying to get across, so let me try to re-formulate. Sorry for the bother.

For one, is there any specific reason that you used dcterms: isPartOf instead of dcterms:hasPart in your model?

Secondly, if I go with the left route of your model:

:TUDarmstadt a dcat:Catalog ; dcterms:hasPart :FB16 .
:FB16 a dcat:Catalog ; dcterms:hasPart :IMS .
:IMS a dcat:Catalog ; dcterms:hasPart :DFG_AMOS .
:DFG_AMOS a dcat:Catalog ; dcat:dataset :Supp_Data1, :Supp_Data2 ;

From my understanding, that would be the same as if I used dcat:catalog instead of dcterms:hasPart

:TUDarmstadt a dcat:Catalog ; dcat:catalog :FB16 .
:FB16 a dcat:Catalog ; dcat:catalog :IMS .
:IMS a dcat:Catalog ; dcat:catalog :DFG_AMOS .
:DFG_AMOS a dcat:Catalog ; dcat:dataset :Supp_Data1, :Supp_Data2 .

Which means that: :TUDarmstadt is a dcat:Catalog with one listed dcat:Resource (:FB16) :FB16 is a dcat:Catalog with one listed dcat:Resource (:IMS) :IMS is a dcat:Catalog with one listed dcat:Resource (:DFG_AMOS) :DFG_AMOS is a dcat:Catalog with two listed dcat:Resources (:Supp_Data1, :Supp_Data2)

This understanding is based on the following definitions (I’m quoting the definitions from https://www.w3.org/TR/2022/WD-vocab-dcat-3-20220111/)

dcterms:hasPart: An item [my addition: dcat:Resource] that is listed in the catalog. [only in the context of dcat:Catalog, for dcat:Dataset it is the more generic definition] dcat:dataset: A collection of data that is listed in the catalog. dcat:service: A site or end-point that is listed in the catalog. dcat:catalog: A catalog that is listed in this catalog.

With the Usage note for dcterms:hasPart stating “This is the most general predicate for membership of a catalog. Use of a more specific sub-property is recommended when available. “.

I assumed dcterms:hasPart is suggested to be used if you were to create your own subclass of dcat:Resource and wanted to list an instance of that subclass in the catalog. But if you used it with an existing subclass of dcat:Resource (dcat:Dataset, dcat:DataService, and dcat:Catalog) it would have the same meaning as dcat:dataset, dcat:service, and dcat:catalog. After all, the more specific definitions are only recommended.

What I want to show is that :TUDarmstadt is a dcat:Catalog that contains a number of datasets. There are certain subcatalogs that only contain a subselection of those datasets. I think that would be possible with the dcat:Dataset interpretation of dcterms:hasPart, but not the dcat:Catalog interpretation of dcterms:hasPart, because that one is identical to dcat:catalog.

The point is was trying to make: I think it would be good to have, for dcat:Catalog, properties that cover both interpretations of dcterms:hasPart. Or make it clear that these interpretations exist, if they do.

The best “other” option I was able to think of so far to create some relation between the dcat:Catalogs would be:

:TUDarmstadt a dcat:Catalog ;
 dcat:catalog :FB16 ;
dcat:dataset :Supp_Data1, :Supp_Data2 .

:FB16 a dcat:Catalog ; 
dcat:catalog :IMS ;
dcat:dataset :Supp_Data1, :Supp_Data2 .

:IMS a dcat:Catalog ; 
dcat:catalog :DFG_AMOS ;
dcat:dataset :Supp_Data1, :Supp_Data2 .

:DFG_AMOS a dcat:Catalog ; 
dcat:dataset :Supp_Data1, :Supp_Data2 .

However, this would mean that the lower level catalogs are also listed in the higher level catalogs. This is not what you would see in most views and not what is seen the OAI-PMH “distributions”. Also, it would mean then we would have a heap of dcat:Catalogs and not a big one with subcatalogs, at least not explicitly. I don’t like that with regard to the dcat:Catalog usage note “A Web-based data catalog is typically represented as a single instance of this class.”. Finally, it does not get the message across that content listed in the lower level catalogs is by default also listed in the higher level ones.

andrea-perego commented 2 years ago

@andreasgeissner said:

Thanks for the detailed response, @andrea-perego . I had typed but not yet proofread a lengthy text to clarify my concern (which is copied below in case you need information out of it for the new issue), but Issue https://github.com/w3c/dxwg/issues/1469 is exactly the solution i imagined. Thanks again!

If I'm not mistaken, the two main points you raise concern:

  1. dcterms:hasPart vs dcat:catalog
  2. dcterms:hasPart vs dcterms:isPartOf

About point (1), dcterms:hasPart and dcat:catalog have different semantics, as said in https://github.com/w3c/dxwg/issues/1469 . In particular, dcat:catalog (as well as dcat:dataset and dcat:service) is meant to link a set (dcat:Catalog) to one of its elements (another dcat:Catalog). Therefore, it cannot be used to link nested catalogues, whereas dcterms:hasPart can be used for that purpose.

BTW, you can find existing examples of this use in DCAT-AP and DCAT-AP-JRC - e.g.: https://ec-jrc.github.io/dcat-ap-jrc/#catalogue-collection

About point (2), dcterms:hasPart is included in DCAT because the direction of its sub-properties is from a catalogue to the listed resources. However, dcterms:isPartOf can also be used in addition to (but not to replace) dcterms:hasPart - about this, see §7. Use of inverse properties.

In case you have reasons to think it should be the other way round, I suggest you contribute them to https://github.com/w3c/dxwg/issues/1469

andreasgeissner commented 2 years ago

Thanks for pointing me to the DCAT-AP use of dcterms:hasPart.

As I said, your suggestions #1469 look perfect to me, so I have nothing further to add. Thanks again!

andrea-perego commented 2 years ago

Thanks for pointing me to the DCAT-AP use of dcterms:hasPart.

As I said, your suggestions #1469 look perfect to me, so I have nothing further to add. Thanks again!

Thanks for confirming it, @andreasgeissner .

Issue closed.