gruninger / Common-Logic

Documents for the developments of ISO 24707 Editiion 2 (Common Logic)
8 stars 3 forks source link

Network identifiers of CL documents (and of texts inside them) #41

Open clange opened 11 years ago

clange commented 11 years ago

40 introduced the notion of a CL document. Documents are commonly retrievable in a network (e.g. downloadable from a IRI if that IRI is a URL) and therefore, quite naturally, have network identifiers.

To avoid confusion and redundancy, we suggest that this name defines the base IRI (in terms of RFC 3986/3987) of the CL document, and that all relative IRIs used for names inside this document are to be interpreted relatively to that base. (Note that so far the "foo" in (cl-text foo) has not been interpretable as an IRI, as the base wasn't defined!)

Suppose there is a document http://example.org/repo/foo (the issue of whether this can be a file foo.clif is separate and purely technical; doesn't have to be discussed here).

We propose:

If this document contains an unnamed text (as defined in #40), we implicitly assume that this text has a name: the network identifier = base IRI of the document. This still allows for purely unnamed texts: They can still be created on the fly, in memory, without the notion of an enclosing document.

If this document contains named texts, their names (and likewise any other network identifiers in the document) are to be resolved against the base IRI.

This is particularly compatible with the COLORE practice of having files named foo (actually foo.clif) that contain a single named text

(cl-text foo)

Taking "foo" as an IRI relative to the base, it refers to "the file foo in the current directory" (sloppily spoken), whose only CL text is exactly this text.

Beyond the COLORE practice, it is possible to have documents that contain multiple named texts, e.g.

(cl-text foo)
(cl-text bar)

This is not compatible with linked data practice, a practical way of distributing ontologies on the Internet, which recommends identifying things by those IRIs=URLs from which a description of these things can be downloaded.

Therefore, in the case of multiple texts in a document, we suggest having them as fragments of that document, i.e. .../document#foo, .../document#bar, written as

(cl-text #foo)
(cl-text #bar)

Once we get to namespace prefixes (quite soon, as a new issue), it will be possible to abbreviate this and get rid of writing the hash.

greenTara commented 11 years ago

I'm not convinced it is a good idea to take something that is arguably [1] a recommended practice (reference please) in the world of web ontologies and make it normative in Common Logic. In general, I think CL should be less restrictive than less expressive languages, not more so. I think it would fine to have an optional syntax for defining the base IRI, and in the case that this information is absent, to have the default base IRI be the URL from which the document was obtained, this being most consistent with RFC 3986/3987.

I aggree that it would be better to recommend using a prefixing mechanism to abbreviate IRIs rather than relative IRI references, and that would avoid the issue of the base IRI altogether.

[1] It was my understanding that the reason there is a convention in the linked data world for assigning some significance to the retrieval URL is due to the lack of a named graph syntax, not to this being an inherently good idea.

greenTara commented 11 years ago

Another reason to avoid relative IRI references is the potential for confusion with local, non-IRI names. It is allowed to use IRI references as terms, functions and relations. So

("http://example.org/a" "http://example.org/b")

is a legal CLIF sentence. Then if the retrieval IRI is "http://example.org/myclif.txt", this sentence could be represented with relative IRI references as

("a" "b")

How would a parser determine the difference between such a sentence and one that was intended to actually use non-IRI names?

RDFa 1.1 http://www.w3.org/TR/2010/WD-rdfa-core-20100422/ has solved this issue nicely with CURIE syntax.

clange commented 11 years ago

Tara, a combined reply to your two comments:

I'm not convinced it is a good idea to take something that is arguably [1] a recommended practice (reference please) in the world of web ontologies and make it normative in Common Logic.

I agree. (Of course we should see what best practices we can promote in a non-normative way for CL. If we want CL to be used over networks in practice, adopting something like linked data may make sense.)

I think it would fine to have an optional syntax for defining the base IRI, and in the case that this information is absent, to have the default base IRI be the URL from which the document was obtained, this being most consistent with RFC 3986/3987.

This is actually, at least, consistent with what I said above about the COLORE case of having, in documents with URLs like http://example.org/repo/foo texts named (cl-text foo).

So let's just discard my initial proposal to give implicit network identifiers to unnamed texts. It is probably more in line with the CL philosophy to blame unnamed texts on the author, or to assume that the author had a good reason for not naming the text. – However, what would that mean for importing? If the file at http://example.org/repo/foo consists of an unnamed text, what does (cl-imports http://example.org/repo/foo) mean?

I aggree that it would be better to recommend using a prefixing mechanism to abbreviate IRIs rather than relative IRI references, and that would avoid the issue of the base IRI altogether.

I agree that this design consideration is up to us. If the design goal is to just somehow allow for long IRIs to be abbreviated, then prefixes will suffice, and we don't need base IRIs.

[1] It was my understanding that the reason there is a convention in the linked data world for assigning some significance to the retrieval URL is due to the lack of a named graph syntax, not to this being an inherently good idea.

I'd say it is inherently a good idea, but only when you assume a level of simplicity of implementation that is far too specific for CL in general. There do exist non-standard (i.e. non W3C) named graph syntaxes, which are widely in use in practice (e.g. TriG). But what makes the above-mentioned linked data convention so nice is that when you talk about something that has an identifier you don't need any additional lookup tables or magic knowledge for retrieving a description of that thing. Instead you simply download it from its URL. Of course one can consider this a poor man's replacement for named graphs (actually not so "poor man": One needs to have control of a domain and a webserver installed there!) – but it also good design in itself, and it is actually the essence of what Tim Berners-Lee calls "linked data".

Whereas in CL implementations I wonder how we'd keep track of a mapping between identifiers of texts and the URLs where they can actually be downloaded from. (So that's why I say we should probably encourage a linked data best practice.)

Another reason to avoid relative IRI references is the potential for confusion with local, non-IRI names. It is allowed to use IRI references as terms, functions and relations. So

("http://example.org/a" "http://example.org/b")

is a legal CLIF sentence. Then if the retrieval IRI is "http://example.org/myclif.txt", this sentence could be represented with relative IRI references as

("a" "b")

How would a parser determine the difference between such a sentence and one that was intended to actually use non-IRI names?

One answer to this could be that we don't want non-IRI names. (Well, at least that we don't want non-IRI identifiers.) I recall we've had this discussion earlier, and I don't recall the outcome, but we could acknowledge that the "IRI network" has proven sufficiently general and sustainable that we can assume that any network in the CL sense will use IRIs. (Again this should be a separate ticket if we'd like to go that way.)

RDFa 1.1 http://www.w3.org/TR/2010/WD-rdfa-core-20100422/ has solved this issue nicely with CURIE syntax.

BTW this is now a recommendation and available at http://www.w3.org/TR/rdfa-core/.

greenTara commented 11 years ago

To summarize the situation about named and unnamed texts, (let's ignore modules for the moment) unnamed texts have a very important role in CL - this is the standard way to actually assert things in CL. A named text doesn't assert anything except about the denotation of its name. Only when a named text is imported into an unnamed text does the body of the text get asserted.

Modules mash all this behavior up together- the text gets a name, and the text gets asserted, and the text can be imported, plus there are fishy things that go on about the vocabulary and domain of quantification. I hope that the semantics of modules gets straightened out and these get used more often, such as in COLORE. All the issues about whether the names of texts being in the domain of discourse messes up unrestricted quantifiers, such as "There are 6 things in the universe" could be solved if modules were set up to define this universe and shield it from the effects of adding other texts to the collection.

As to keeping track of identifiers and URLs, I don't know about CLIF, but there are standard XML methods for doing this (XML Catalog https://www.oasis-open.org/committees/entity/spec-2001-08-06.html). A particularly nice feature is caching, which many real-world applications are likely to implement. The URL where the document was originally published becomes practically irrelevant once the IRI (not necessarily http scheme) is associated with a local copy. These are pragmatics which fortunately we don't have to go into in the CL standard.