samvera-deprecated / geo_concerns

PCDM based geospatial models for Hydra
Apache License 2.0
3 stars 4 forks source link

dcterms:spatial breaks the range of that predicate. #108

Closed scande3 closed 8 years ago

scande3 commented 8 years ago

I was looking at this project to get inspiration for representing geographical data on objects represented with RDF and noticed that one of your predicates breaks the stated allowed range. This would be the following one: https://github.com/projecthydra-labs/geo_concerns/blob/db90f53d8a73a77570ed408b91aef320cff84650/app/schemas/basic_geo_metadata_optional.rb#L25

The official declaration (along with the usage guidelines) states that predicate must be a uri rather than the string list you have it setup to be: http://wiki.dublincore.org/index.php/User_Guide/Publishing_Metadata#dcterms:spatial

Does this group consider this an issue at all or is the consensus that that range isn't a problem?

(Additional side inquiry as well: as RDF is unordered, one loses the hierarchy of terms. So the comment of "[ 'France', 'Spain' ]" could be returned as "[ 'Spain', 'France' ]" meaning representing things consistently in a UI might be challenging. Is there some documentation somewhere on what this group might be considering to represent a geographic hierarchy?).

johnhuck commented 8 years ago

Hi @scande3 ! Hope you are well. Thanks so much for the comments and raising the issue for us.

You raise a valid point about the range of dcterms:coverage. My own understanding of the conundrum of choosing between dc: elements and terms continues to evolve, and so I'm always glad to have a chance to learn from others.

The safer thing to do in this case would be to use dc:coverage from the elements/1.1/ namespace, which does not have domain or range defined for it. So perhaps we should do that for the sake of an easy (and correct) fix. @drh-stanford @eliotjordan @jrgriffiniii ?

At the same time, the user guide doesn't say they must be URIs but that they must be non-literals, and the various examples in the guide (which are really helpful, btw) make liberal use of blank nodes in order to match the defined range for the predicates. Given that the textual description of dcterms:coverage specifically mentions coordinates ("Spatial topic and spatial applicability may be a named place or a location specified by its geographic coordinates"), which would be hard to mint a URI for, and the example in fact shows them implemented with the help of a blank node, I have to conclude that blank nodes are presumed to be necessary to use many of the terms in the dcterms: namespace that are defined for non-literals (and especially the ones that go beyond the original 11 elements) with correct usage.

But I don't know how widespread absolutely correct usage is in the wild, and I don't know what Curation Concerns and the rest of the Hydra stack does with blank nodes or how they are viewed in terms of development work/PCDM/HydraWorks etc. I would love to find out!

My intuition tells me that in general most (or many) people probably blithely use dcterms: terms for literals, blissfully unaware of their formal properties and strictly correct usage. In fact, one could argue that the ability of people to use dc elements however they liked, with low restrictions is one of the reasons it has been so successful.

And I think it's less a case of 'breaking' the range in the sense of datatype validation (although I could be wrong), as interfering with inferencing that happens at a later point, when the inferencing says that such-and-such literal value must be a member of class X, because of the range of the predicate. If it's the difference between saying that a blank node is a member of class X or a literal is a member of class X, there's less 'value' or 'meaning' being lost there than between a URI and a literal. I don't know what kind of a smashup happens when an inferencing engine expecting a non-literal runs into a literal, but I would hope it wouldn't be catastrophic.

In the past, I have taken the comment at the end of this FAQ page as an indication that DCMI recognizes, perhaps even accepts, that the terms may not be used with strict observance of the semantic specifics (http://wiki.dublincore.org/index.php/FAQ/DC_and_DCTERMS_Namespaces):

"Update, December 2011: It is worth noting that the Schema.org initiative is taking a pragmatic approach towards the formal ranges of their properties [6]:" 'We also expect that often, where we expect a property value of type Person, Place, Organization or some other subClassOf Thing, we will get a text string. In the spirit of "some data is better than none", we will accept this markup and do the best we can.' "What constitutes "best practice" in this area is bound to evolve with implementation experience over time."

What is 2016 best practice? Is it closer to correct usage now or still somewhere in the middle? Always a good question to ask.

Anyway, thanks again, Steven.

scande3 commented 8 years ago

@johnhuck - You are already using Dublin Core Elements coverage to be a dcmi-box or dcmi-point from what I can tell of: https://github.com/projecthydra-labs/geo_concerns/blob/ff39585c833fbf0f09fdfb444888ec4e48f119d8/app/schemas/basic_geo_metadata_required.rb#L17

Using that to also represent a bunch of place labels may make parsing that element more complex for everyone (along with making a form to edit that field more challenging). Additionally, that dcmi-box allows for a "name" in its string encoding that acts as a label already if one has coordinates on the object.

(That works for some of what I plan to do and I'd use it as that code model specifies. My additional use cases is for things that lack coordinates but may still be geographic. Or for things that have a complex geographic hierarchy. Essentially this is just research being done for the MODS and RDF group as we figure out what to map geographic subjects to in the linked data world).

Opinions are currently split over whether it is valid to ignore the range of an element. The MODS and RDF group currently is leaning towards using a different predicate or creating our own rather than invalidate what a predicate specifies. The Geo Concerns group may feel differently as a whole and thought I'd see if that was the case.

I wouldn't recommend using a blank node to solve this issue. They don't work that well - both in terms of how it is stored in Fedora 4 and in trying to parse them when resolving an object.

Thanks!

johnhuck commented 8 years ago

@scande3 - You know, even though you put dcterms:spatial in the issue name, I read your original comments and had it in my mind that you were commenting on our use of dcterms:coverage, not dcterms:spatial, and I was confused because I thought we had used dc:coverage, but did not take the time to investigate that doubt. If I had, I would have quickly clued in to the fact that I had misread your question entirely from the start. Clearly not enough coffee this morning.

So, in terms of our use of literals with dcterms:spatial, there's obviously no quick fix to contemplate. Most of my comments, I think are still applicable, though.

When we modelled the attributes, our aim was to ensure that objects modelled in geoconcerns could easily be ingested into GeoBlacklight, and therefore that the geoblacklight schema was supported (which can be expressed in JSON, but is natively a set of solr index attributes, drawing on predicates from several namespaces, including dc: and dcterms:).

https://github.com/geoblacklight/geoblacklight-schema/blob/master/docs/geoblacklight-schema.markdown

dcterms:spatial was already in use in the schema, and so we included it as a property for our objects. We included dc:coverage later because we needed a different predicate to encode the bounding box. (We had the darndest time trying to find a simple predicate in an actually published vocabulary (georss:bbox did not meet this criteria) for bounding box, before we realized DCMI had a string encoding scheme we could use.) Some GeoBlacklight adopters may be sharing their metadata in some encoding of the geoblacklight schema and using dcterms:spatial with literals, I am guessing.

(cf. https://github.com/OpenGeoMetadata/metadatarepository)

Not strictly correct, I agree, but it's hard to argue with someone who will call it a question of finding the shortest distance between two lines to solve a problem, namely recording a simple attribute/value pair. It's perhaps akin to Jay-walking.

In terms of looking for something else to avoid the range problem, if there are good options to look at, sure, but my sense is that good options are limited at this point, although I'd be glad to be proved wrong. Perhaps creating a new vocabulary of predicate terms is the answer, and having the flexibility to create one and host it in the Hydra neighbourhood is great (I've been following the discussion).

But there is going to be an ongoing need to record string values in metadata all over the place, so I don't see this problem going away any time soon. It's a tension between traditional metadata needs and the constraints of working with RDF vocabularies. Independently of my own thoughts on the matter (which are unresolved), If I were to bet money, I would bet that common usage of dcterms: with literals will be widespread until there is a convenient alternative. But I'm glad to know which way the MODS and RDF group is thinking, since you have thought more about the issue than we have.

I would be very interested in trading thoughts with you about the geographic elements of MODS, as it relates to work I am doing/will soon be doing at my own institution to plan for moving our MODS into RDF. I think you have been working with my colleague Bob Cole in your group, and he is part of our metadata team that is looking at this collectively. I've been tasked with looking at the aspects of MODS that are applicable to our geospatial resources.

When you ask about hierarchical place names, I think you must be thinking of the hierarchical place subjects in MODS, which I am familiar with.

It's an interesting question, because the logic of linked data would say: find a URI for the smallest place and derive relationships with larger territories via inferencing in the URIs home graph (e.g., geonames). i.e., don't try to store the hierarchy in the metadata. But I don't know that systems are yet at the state where that is practical. And that doesn't help with keyword searching or solr indexes. So I'm not sure what the answer is. I mean, if you want strings you can always use LCSH rules on geographic subdivision, but that's moving in the opposite direction of things like FAST. But if it's URIs, then I'm not sure how to represent hierarchy in place names, unless it's just an unordered list.

johnhuck commented 8 years ago

The WG discussed this issue at length. After investigating some other predicate options, and considering: the lack of consensus in the community on the importance of using only non-literal values dcterms predicates (most of which suppose non-literal values), as exemplified by current practice for geoblacklight schema and DCAT, to name two examples; the current state of support in Hydra/Fedora for managing URIs and strings; the belief that some data is better than none, in the context of the fact that metadata extracted from ISO, FGDC or other external metadata files will probably mainly consist of place name strings; the fact that geoconcerns does not prevent users from supplying a URI for the dcterms:spatial predicate; and the general desire to keep the metadata model aligned with the geoBlacklight schema, the WG has decided to continue using dcterms:spatial to record place names, in whatever form the user supplies them, for the time being.

scande3 commented 8 years ago

@johnhuck - thanks to the update! Will close this issue then.

johnhuck commented 8 years ago

We appreciated the chance to consider the question and were glad that you raised it. It's something that I can imagine us returning to at a later date.