w3c / dxwg

Data Catalog Vocabulary (DCAT)
https://w3c.github.io/dxwg/dcat/
Other
150 stars 47 forks source link

Semantics of the dcat:bbox attribute could (should?) be more explicit #1392

Closed JoepvanGenuchten closed 2 years ago

JoepvanGenuchten commented 3 years ago

The http://www.w3.org/ns/dcat#bbox attribute is somewhat vague (or perhaps lends itself to misinterpretation) dcat describes resources (data-sets) and while I suspect that this attribute is intended to say something about what the resource is about, it could also be interpreted as the (physical) nature of the resource itself. A somewhat forced example: an old fashioned phone-book is about a region or a municipality but simultaneously a phonebook has a bounding-box (describing how thick it is for example and where it lies on my shelf).

In general, naming an attribute after its type (or range in rdfs terms) is not specific enough to be clear about its intent (although that might be a personal preference). While the the spec mentions the range is 'intentionally generic' the technical origin of this concept (CAD drawing, BIM etc, where this is a mathematical construct to represent a 3 dimensional object ) will most likely cause most people (and people programming machines) to ignore the 'intentionally generic' part of the spec.

My suggestion would either be to rename this attribute to make its intent more clear (dcat:subjectBoundingBox for example), or elaborate in the description on its intended meaning (phone book vs region the phonebook covers)

andrea-perego commented 3 years ago

@JoepvanGenuchten , thanks for raising this issue.

I think there are two separate aspects here: one is about the range, and one about the actual semantics of this property.

  1. About the range, what is generic is the literal encoding of the geometry (rdfs:Literal). In the current draft of DCAT3 we made a slight revision to clarify this point (see https://github.com/w3c/dxwg/issues/1359):

    The range of this property (rdfs:Literal) is intentionally generic, with the purpose of allowing different geometry literal encodings. E.g., the geometry could be encoded as a WKT literal (geosparql:wktLiteral [GeoSPARQL]).

    Do you think this is not clear enough, or I'm missing your point here?

  2. About the semantics of dcat:bbox, the property is meant to specify the geographic bounding box of a resource, as per its definition. So, if the resource is a phonebook, as in your example, it would denote the bounding box of the geometry of the phonebook itself:

    a:PhoneBook dcat:bbox "POLYGON((...))"^^geosparql:wktLiteral .

    On the other hand, to specify the geographic / administrative area covered by the numbers in the phonebook, you should use dcat:bbox with the class instance (e.g., dcterms:Location) denoting that area, and link the area to the phonebook (e.g., by using property dcterms:spatial):

    a:PhoneBook dcterms:spatial [ a dcterms:Location ;
      dcat:bbox "POLYGON((...))"^^geosparql:wktLiteral ] .

    This is exactly how dcat:bbox is used in DCAT to specify the spatial coverage of a dataset, as show in the relevant example in the DCAT specification: https://w3c.github.io/dxwg/dcat/#ex-spatial-coverage-bbox

    Does this address your concerns?

rob-metalinkage commented 3 years ago

@andrea-perego is correct - semantics is relative to the domain the property is used for.

There needs to be some out-of-band (i.e. not in DCAT model) statement about how any domain intends to use any generic property - so example how to interpret the concept of a "box" is probably domain specific.

another example is dcterms:conformsTo - in DCAT it relates to the dataset - whereas in most other usages it seems to relate to the information object itself, not the object it is describing.

NB the semantics of "conformance" itself caused much debate - and was eventually delegated to the domain to interpret.

JoepvanGenuchten commented 3 years ago

@rob-metalinkage @andrea-perego thank you for your responses. This clarifies a a lot, that also helps me refine my own comment here.

tldr: my feedback comes down to 2 things:

in more detail (and reverse order ;-)):

1.a) I see that the argument here is that the exact meaning can be derived from the domain and the range. Fair enough, that is indeed one of the strengths of owl. I am still worried about possible misinterpretations, and i think a more elaborate definition could go a long way ( right now it says "The geographic bounding box of a resource." see also next point)

1.b) About the naming of the dcat:bbox attribute (and upon further reading, the same can be said for dcat:centroid): When making a semantic model, we aim to give a clear name (label, uri) to any rdf:property (be it a datatype property or an object property) that references the meaning or significance of the relationship. To take an example of where this is obviously done right: In rdf schema there are multiple relationships between rdf:Property and rdfs:Resource. We have rdfs:domain and rdfs:range. We untuitively understand that just because we have a relationship that points from something of type Property to something of type Resource, that we can just call this relationship rdfs:resource. We give the propertie(s) clear names (and uri's) that tell us something about what we mean by them. But I feel here we make that exact mistake. By defining dcat:bbox (even if you say its a literal because it can be any technical serialization of a bounding box), we are basically saying (or at the very least implying): there is only 1 semantically meaningful relationship between dcat:Resource and any bounding box representation, and we are not very clear about what we mean by it. Alternative names for dcat:bbox (that I think say more about what we want know, or emphasize what we intend) could be "dcat:occupiesPhysicalSpace" or something like that. This also leaves the option to use another way of representing this information. requiring it to be a (certain kind of) bbox, in my opinion, belongs to the realm of shacl (see also next point).

2.a) Taking a step back: why does dcat concern itself with bounding boxes (and centroids) in the first place? Does the rather technical object definition of bbox really add, functionally speaking, to what dcat is trying to achieve in terms of how we communicate about our data catalogs and their resources? There are whole taxonomies of how to model geometries and shapes, some much more accurate than a bounding box (why would a swept solid represent this information any less accurately?), or , some less, why pin it down on this one if what you really want to know is where the resource is physically located?

2.b) Is this relationship really different from geosparql hasGeometry? or a similar relationship in the Industry Foundation Classes? if so, why does this vocabulary have such unique requirements that it should define its own property for it, can't we just conform to the models that domain experts have made for shapes and geometries? If not: why not explicitly use one of those?

Hope this helps!

rob-metalinkage commented 3 years ago

I think the GeoSPARQL group are possibly looking providing options for describing semantics of geometry properties. These should perhaps just be used via a GeoDCAT profile. Any real world spatial object has multiple possible geometric representations - and these may vary according to other aspects of its state. A dataset covering the spatial domain of some object would potential share these.

andrea-perego commented 3 years ago

@JoepvanGenuchten ,

About "why does dcat concern itself with bounding boxes (and centroids) in the first place?", you can find the background discussion in https://github.com/w3c/dxwg/issues/83 , which also links to the relevant use case in DXWG UCR document. Trying to summarise it:

The original version of DCAT did not provide guidance on how to specify the spatial coverage of a dataset by using geometries. Implementation experiences shew that this gap was raising interoperability issues, and therefore DCAT2 addresses it by supporting specific properties for the most typical cases - i.e., geometries, bounding boxes, and centroids.

The reason why specific properties for bounding boxes and centroids have been defined in the DCAT namespace was that there was no standard way of doing this - i.e., commonly used vocabularies, as the W3C Basic Geo and GeoSPARQL, don't have such properties (something that is instead being addressed in the new version of GeoSPARQL under development), with the exception of Schema.org (schema:box).

About your note at point 1.b:

By defining dcat:bbox [...], we are basically saying (or at the very least implying): there is only 1 semantically meaningful relationship between dcat:Resource and any bounding box representation, and we are not very clear about what we mean by it.

Any suggestion on how to improve the description of dcat:bbox is more than welcome. However, as I said in https://github.com/w3c/dxwg/issues/1392#issuecomment-905022220 , dcat:bbox is just meant to specify the bounding box of a spatial thing, but it does not exclude other types of relationships between this thing and a bounding box. I gave the example of spatial coverage, but the same approach applies to any other type of relationship - e.g., a relationship as the one you give as an example (:occupiesPhysicalSpace) can be specified as follows:

a:Resource a dcat:Resource ;
  :occupiesPhysicalSpace [ a dcterms:Location ;
    dcat:bbox "POLYGON((...))"^^geosparql:wktLiteral
] .

I am not sure I answered to all your points, so please let me know whatever I missed. Also, it would be very useful if you could complement the issues you are highlighting with specific use cases and examples, so to better understand the possible weaknesses of the current DCAT approach in addressing your requirements.

JoepvanGenuchten commented 3 years ago

@andrea-perego thank you for the links to the other discussions.

About "why its in here": I see a lot of clarifying arguments, especially in issue 83. I agree with the formal arguments about why you might want a separate object property to describe this particular usecase. But I maintain it should be handled by a geospatially oriented vocabulary and not by one aimed at data cataloging. I would say it makes the catalogging vocabulary cluttered/bulky/bloated with concepts that are/should be handled in different domains (personal context note: I have been working with the IEC-CM which is a fantastic but massive reference model for the electric utilities. Its very size stands in the way of adoption by the shear intimidation new-comers feel when trying to work with it, so keeping reference models small and 'digestable' is one of the things I try to aim for in my work) . Having said that: i get the sense that decision about this issue has been made and I might disagree with the outcome, but perhaps I should lay that to rest.

Upon more detailed inspection of the ontology (admittedly, I had only looked at the TR until now) , I realize the domain of these properties is dcterms:Location and not Resource which is what the text suggests (this makes it even more tempting to hammer on the previous point, but I wont ;-)). Given this, I would propose the following:

for dcat:bbox: "Represents the physical space a dcterms:Location occupies" for dcat:centroid: "Represents the centerpoint of a dcterms:Location"

I appreciate the time you guys take to work through this.

bertvannuffelen commented 3 years ago

@JoepvanGenuchten

About "why its in here": I see a lot of clarifying arguments, especially in issue 83. I agree with the formal arguments about why you might want a separate object property to describe this particular usecase. But I maintain it should be handled by a geospatially oriented vocabulary and not by one aimed at data cataloging. I would say it makes the catalogging vocabulary cluttered/bulky/bloated with concepts that are/should be handled in different domains

This is a general issue on modularity and reuse. I agree that if there would be like dcterms a very generic geo vocabulary defining the bounding box of a resource (geo:bbox) then this property could be reused in this usage context. However it does not. (At least to my knowledge). To satisfy the usecase to express bounding box information about the spatial coverage of a dataset a new property has to be created with a URI in the DCAT domain. That is all fine.

The issue comes when other people would like to reuse this property outside the scope of DCAT. Then the story becomes difficult. Because then it looks as if DCAT has defined the domain neutral universal property geo:bbox, while DCAT is, as you mentioned, a scoped vocabulary about cataloging resources.

From the DCAT vocabulary perspective one cannot avoid the cherry picking reuse beyond the DCAT scope. But I agree DCAT should not become the upper ontology for the semantic web, just because this one is active. In the first place the semantics given to dcat:bbox should support the DCAT usecases. In that case I am happy with the current definition, and do not feel the need to add "physical" to it. It, though, might be improved w.r.t. 'resource'. I admit that by reading the paragraph https://w3c.github.io/dxwg/dcat/#Property:location_bbox as such one could interpret resource as a cataloged resource and not as a rdfs:Resource. Because the domain is way up, and only visible after scrolling. This might be (one of) the source(s) for filing the issue. For other properties in the same situation: e.g. https://w3c.github.io/dxwg/dcat/#Property:checksum_algorithm the domain has been added.

JoepvanGenuchten commented 3 years ago

@bertvannuffelen I can accept not adding "physical", but would push on not using the term 'resource' in the definition and replacing it with (dcterms:)Location, beyond issues about documentation rendering. After all: by setting the domain of the property to be dcterms:Location, we say this is a property of the location, and not of the resource (dcat or rdfs) that happens to find itself there.

dr-shorthair commented 3 years ago

@bertvannuffelen is this what you are looking for? geosparql:hasBoundingBox sf:Envelope

bertvannuffelen commented 3 years ago

@dr-shorthair yes this seems close to what I meant. I went through the geosparql spec and there might be compatibility problems with the domains and ranges:

a) domain geosparql:hasBoundingBox has as domain geo:Feature, which is a geo:SpatialObject.

So now the question rises if one would like to reuse it, is dct:Location a geo:SpatialObject? If that is not the case then reuse creates discussions. Do you know the answer?

b) range geosparql:hasBoundingBox has as range geo:Geometry which is different from dcat:bbox which is a rdfs:Literal

Although the intend of the property is very similar, the chosen modeling is probably not trivially compatible with eachother. And maybe after a more in depth investigation it are properties that cannot be merged.

dr-shorthair commented 3 years ago

@bertvannuffelen I see no problem with a). geosparql:Feature is pretty general, and I can't see any undesirable entailments.

On b) I think it works if geosparql:hasBoundingBox is a sub-property of dcterms:spatial, and dcat:bbox a sub-property of geosparql:hasSerialization .

andrea-perego commented 3 years ago

@dr-shorthair said:

On b) I think it works if geosparql:hasBoundingBox is a sub-property of dcterms:spatial, and dcat:bbox a sub-property of geosparql:hasSerialization .

Not sure about this. dcat:bbox links a spatial thing to its bounding box, specified as a literal, so I think it is more correct to say that it corresponds to the following property chain:

geosparql:hasBoundingBox / geosparql:hasSerialization
andrea-perego commented 3 years ago

Trying to summarise the results of the discussion in this thread, and outlining possible actions:

  1. Improve the definition of dcat:bbox & dcat:centroid: @JoepvanGenuchten , @bertvannuffelen , thanks for highlighting the ambiguity of the term "resource" in the current definition of these properties. We'll discuss this issue in one of the next DCAT meetings.
  2. Re-use reference vocabularies for use cases where DCAT spatial properties are not enough: The new version of GeoSPARQL can do the job, also because it addresses those gaps that led to the definition of spatial properties in the DCAT namespace.

About point (2), there's of course the issue of having two alternative ways to specify the same information, which does not help interoperability. However, this might be mitigated by defining mappings, as explained by @dr-shorthair .

@JoepvanGenuchten , @bertvannuffelen , are you happy with this summary? Is there anything I left out?

bertvannuffelen commented 3 years ago

@andrea-perego I am happy with the summary.

JoepvanGenuchten commented 3 years ago

@andrea-perego good summary, I have nothing to add.

andrea-perego commented 2 years ago

@JoepvanGenuchten , @bertvannuffelen ,

The following point:

  1. Improve the definition of dcat:bbox & dcat:centroid: @JoepvanGenuchten , @bertvannuffelen , thanks for highlighting the ambiguity of the term "resource" in the current definition of these properties. We'll discuss this issue in one of the next DCAT meetings.

has been addressed via PR https://github.com/w3c/dxwg/issues/1423 (now merged in the ED). If you have concerns on the adopted solution, please open a new issue.

Moreover, a new issue (https://github.com/w3c/dxwg/issues/1425) has been created for further discussion on the following point :

  1. Re-use reference vocabularies for use cases where DCAT spatial properties are not enough: The new version of GeoSPARQL can do the job, also because it addresses those gaps that led to the definition of spatial properties in the DCAT namespace.

About point (2), there's of course the issue of having two alternative ways to specify the same information, which does not help interoperability. However, this might be mitigated by defining mappings, as explained by @dr-shorthair .