tdwg / vocab

Vocabulary Maintenance Specification Task Group + SDS + VMS
11 stars 6 forks source link

Rules about IRI construction #33

Closed baskaufs closed 8 years ago

baskaufs commented 8 years ago

What rules, if any, about IRI construction should be specified in the Documentation Spec? There are some conventions that have been followed with AC and DwC, such as using http://rs.tdwg.org/ as the root for namespaces. However, what if TDWG would adopt other terms that are already in use, such as http://rs.gbif.org/vocabulary/gbif/establishment_means/native, http://purl.org/dsw/hasOccurrence, or http://purl.obolibrary.org/obo/BCO_0000071 ? When I say "adopt", I don't mean borrow as we do Dublin Core terms, but rather take responsibility for those terms as part of a TDWG defined vocabulary.

Also, current AC and DwC terms follow the "slash URI" pattern rather than the "hash URI". Does it matter what future vocabularies use, or should the specification be silent on this? See section 3.3.3.1 of the documentation specification

ramorrismorris commented 8 years ago

FWIW, I probably regret that we didn't use the hash pattern for AC.

From my mobile phone On Apr 16, 2016 12:36 PM, "Steve Baskauf" notifications@github.com wrote:

What rules, if any, about IRI construction should be specified in the Documentation Spec? There are some conventions that have been followed with AC and DwC, such as using http://rs.tdwg.org/ as the root for namespaces. However, what if TDWG would adopt other terms that are already in use, such as http://rs.gbif.org/vocabulary/gbif/establishment_means/native, http://purl.org/dsw/hasOccurrence, or http://purl.obolibrary.org/obo/BCO_0000071 ? When I say "adopt", I don't mean borrow as we do Dublin Core terms, but rather take responsibility for those terms as part of a TDWG defined vocabulary.

Also, current AC and DwC terms follow the "slash URI" pattern rather than the "hash URI". Does it matter what future vocabularies use, or should the specification be silent on this? See section 3.3.3.1 of the documentation specification

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/tdwg/vocab/issues/33

tucotuco commented 8 years ago

@ramorrismorris I am curious why you "probably regret" it. If there is no compelling reason that one choice is better than the other, we should probably remain silent. If there is a compelling reason to do it the way we didn't for DwC and AC, we'll have to add replacement terms for all of the existing ones when the spec gets adopted. That's only busy work for DwC, so not a particular obstacle.

ramorrismorris commented 8 years ago

I'm not sure how compelling it is. What I had in mind was that if enough anchors are set in an html rendering of the term list, then the default link to http://# in an application has an easy shot at landing in the right place in a browser viewing the html of the term list. HashVsSlash seems to lay out the pros and cons and we might discuss whether if it would drive the basis for a best practice.

baskaufs commented 8 years ago

It seems to me that the arguments I've seen about hash vs. slash have to do with server efficiency and whether people are going to be hand-writing small, single documents or not. On the first point, I really can't imaging that any TDWG server is going to be hit frequently by machine clients trying to "discover" the vocabulary. Realistically, most software that uses AC or DwC already "knows" about it and won't be dereferencing the URIs repeatedly. The other thing is that at least with Darwin Core I think we're already on our way to being beyond the situation of hand-writing single documents that will be served when URIs get dereferenced. If we fully implement the hierarchy and version model that's in the draft, it seems likely to me that metadata served upon URI dereferencing might be being assembled into documents from a triplestore on the fly. It's just going to be too complicated to hand-build all the RDF documents for all versions and possible serializations (Turtle, XML, JSON, etc.) that people are going to want. In that case, slash is just as good as hash - the server is going to dissect the URI and build the response based on programming anyway.

I was reviewing the "namespace" part of the DwC Namespace policy document http://rs.tdwg.org/dwc/terms/namespace/ and saw that we already violated it with the dwciri: IRIs, which use http://rs.tdwg.org/dwc/iri/ rather than http://rs.tdwg.org/dwc/terms/. I think we should just get rid of specifying namespaces that must be used. We're already "importing" a lot of terms into AC. Plus, as I said in the original comment, it's possible that TDWG vocabularies may come to include preexisting URIs and it would be better to just re-use them than to change them and break applications.

Let the namespace and identifier design be discussed as part of the term adoption process and do whatever makes sense at that time rather than specifying it in the Documentation spec. Just my opinion, though...

jar398 commented 8 years ago

On Sun, May 1, 2016 at 11:33 PM, Steve Baskauf notifications@github.com wrote:

It seems to me that the arguments I've seen about hash vs. slash have to do with server efficiency and whether people are going to be hand-writing small, single documents or not. On the first point, I really can't imaging that any TDWG server is going to be hit frequently by machine clients trying to "discover" the vocabulary. Realistically, most software that uses AC or DwC already "knows" about it and won't be dereferencing the URIs repeatedly. The other thing is that at least with Darwin Core I think we're already on our way to being beyond the situation of hand-writing single documents that will be served when URIs get dereferenced. If we fully implement the hierarchy and version model that's in the draft, it seems likely to me that metadata served upon URI dereferencing might be being assembled into documents from a triplestore on the fly. It's just going to be too complicated to hand-build all the RDF documents for all versions and possible serializations (Turtle, XML, JSON, etc.) that people are going to want. In that case, slash is just as good as hash - the server is going to dissect the URI and build the response based on programming anyway.

??? if you use a hash, the server won't know what the URI is. The part after the # is only known to the client.

I was reviewing the "namespace" part of the DwC Namespace policy document http://rs.tdwg.org/dwc/terms/namespace/ and saw that we already violated it with the dwciri: IRIs, which use http://rs.tdwg.org/dwc/iri/ rather than http://rs.tdwg.org/dwc/terms/. I think we should just get rid of specifying namespaces that must be used. We're already "importing" a lot of terms into AC. Plus, as I said in the original comment, it's possible that TDWG vocabularies may come to include preexisting URIs and it would be better to just re-use them than to change them and break applications.

Let the namespace and identifier design be discussed as part of the term adoption process and do whatever makes sense at that time rather than specifying it in the Documentation spec. Just my opinion, though...

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/tdwg/vocab/issues/33#issuecomment-216099544

baskaufs commented 8 years ago

I guess what I should have said is that I think slash is better than hash if you want to have the option for the response documents to be generated dynamically. One would have the option to implement some kind of URL re-writing rule that would deliver static documents (which I think is the case now with Darwin Core), but one could also implement some kind of model-view-controller system where the slash URI is dissected by the controller, and view software generates the document dynamically based on the model software. From what Jonathan says, it doesn't sound like that would work with hash URIs. But I'm not really qualified to be talking about this, so maybe somebody else can weigh in.

jar398 commented 8 years ago

OBO uses slashes for pretty much the reason you give. Some of their ontologies are enormous and if you don't know the fragment id there's really no choice but to return the whole ontology. I think their implementation is clever: the response for GET of a single ontology term gives you just the definition (documentation, examples, annotations) of that one term, but there's an owl:imports for the whole ontology, so you (or an OWL processor) can get the whole context if needed.

There is really no good answer however; semantic web architecture is duct tape and coathangers. Slash URIs are associated with the annoying and pointless 303 redirect, so you have to either conform to the 303 advice, or rebel and say you don't get it and aren't going to bother with something you don't get.

On Tue, May 3, 2016 at 11:28 AM, Steve Baskauf notifications@github.com wrote:

I guess what I should have said is that I think slash is better than hash if you want to have the option for the response documents to be generated dynamically. One would have the option to implement some kind of URL re-writing rule that would deliver static documents (which I think is the case now with Darwin Core), but one could also implement some kind of model-view-controller system where the slash URI is dissected by the controller, and view software generates the document dynamically based on the model software. From what Jonathan says, it doesn't sound like that would work with hash URIs. But I'm not really qualified to be talking about this, so maybe somebody else can weigh in.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/tdwg/vocab/issues/33#issuecomment-216565582

ansell commented 8 years ago

There have been many who rebelled against the pointless 303 redirect on functional grounds based on not needing to include document level metadata inside of the documents that they refer to and not seeing why it isn't turtles-all-the-way-down (ie, how do you encode metadata about metadata like caching directives, as caching directives cannot modify metadata or they invalidate the metadata version they just modified). I am fairly new to biodiversity RDF, but as @jar398 will know I have been around RDF in biomedical science sem-web for a while and don't really wish to continue the ongoing 303 debacle.

On point though... I recently started at ALA (Atlas of Living Australia), where they are not using RDF currently, rather using DWC terms with CSV and Single rows per record.

I love IRIs and RDF, but there isn't always agreement on using either of them, mostly due to their verboseness. Ideally both of those technologies should have a technical-based solution that hides the verboseness, but more often than not systems compromise by using a social-based solution of chopping the IRI up and just keeping the localName. In the case of ALA, this social-based solution has resulted in the internal systems solely supporting DWCTerms for all recognised field names because the rest of the IRI has been removed internally for each term.

In some cases I still prefer to use an RDF version of DWCTerms and deconstruct the IRIs to get the localName parts that are used for field names by ALA. Ideally however, I would prefer not to need to deconstruct IRIs to get this information and I could replace that hack with a predicate in the RDF vocabulary containing the recommended field name when using a DWCTerm in a CSV document.

jar398 commented 8 years ago

I agree, there's no reason to drag the horrible 15-year-old unresolvable semweb-hash-303 agony into TDWG. I have complete confidence in Steve and the others involved to ask their own questions and make good decisions on this matter, and I hope they don't waste time doing much research into it. I hope my role here will be to provide information that they / we actually need, although I know it's hard for me not to spout off pointlessly.

CSV is interesting. If we want to talk further, e.g. about what column headings to use, or how/when to use CSVW vs. DwCA vs etc, maybe we should find another forum (tdwg-content)?

On Tue, May 3, 2016 at 8:05 PM, Peter Ansell notifications@github.com wrote:

There have been many who rebelled against the pointless 303 redirect on functional grounds based on not needing to include document level metadata inside of the documents that they refer to and not seeing why it isn't turtles-all-the-way-down (ie, how do you encode metadata about metadata like caching directives, as caching directives cannot modify metadata or they invalidate the metadata version they just modified). I am fairly new to biodiversity RDF, but as @jar398 https://github.com/jar398 will know I have been around RDF in biomedical science sem-web for a while and don't really wish to continue the ongoing 303 debacle.

On point though... I recently started at ALA (Atlas of Living Australia), where they are not using RDF currently, rather using DWC terms with CSV and Single rows per record.

I love IRIs and RDF, but there isn't always agreement on using either of them, mostly due to their verboseness. Ideally both of those technologies should have a technical-based solution that hides the verboseness, but more often than not systems compromise by using a social-based solution of chopping the IRI up and just keeping the localName. In the case of ALA, this social-based solution has resulted in the internal systems solely supporting DWCTerms for all recognised field names because the rest of the IRI has been removed internally for each term.

In some cases I still prefer to use an RDF version of DWCTerms and deconstruct the IRIs to get the localName parts that are used for field names by ALA. Ideally however, I would prefer not to need to deconstruct IRIs to get this information and I could replace that hack with a predicate in the RDF vocabulary containing the recommended field name when using a DWCTerm in a CSV document.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/tdwg/vocab/issues/33#issuecomment-216703709

ansell commented 8 years ago

Yes, CSVW versus/with DwCA is a discussion for another forum. I haven't had a chance to look through the CSVW recommendations yet so will go through them first before starting/continuuing discussion on that area.

baskaufs commented 8 years ago

Based on discussion here and in the 2015-05-04 call, I've reworded section 3.3.3 to make it agnostic about the form of IRIs and removed notes about IRI form from section 3.3.3.1.