owlcs / owlapi

OWL API main repository
815 stars 314 forks source link

How to encode values for an IRI suffix? #475

Closed grossjonas closed 8 years ago

grossjonas commented 8 years ago

Hi,

i tried to put some values into the suffix of IRI.create(String prefix, String suffix) and get them back by calling getShortForm() or getFragment() on the returned instance. I see that many codepoints are getting checked by various XMLUtils-Methods, but escapeXML(CharSequence s) only escapes a few of them.

Is there any generic way to escape or mask values like whitespaces, dots and so on for the suffix?

ignazio1977 commented 8 years ago

You can use java.net.URLEncoder for that purpose.

grossjonas commented 8 years ago

Sorry, but that does not seem to work as expected. I'm from Germany and we have some special chars. I wrote a little test for that:

import org.junit.Test;
import org.semanticweb.owlapi.model.IRI;

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.fail;

public class TestIRI {
    @Test
    public void testIRIURLEncoding(){
        final String suffix = "fooßbar";

        final String encoding = StandardCharsets.UTF_8.name();

        String encoded = null;
        try {
            encoded = URLEncoder.encode(suffix, encoding);
        } catch (UnsupportedEncodingException e) {
            fail(e.getMessage());
        }

        IRI iri = IRI.create("http://www.w3.org/2002/07/owl#", encoded);

        String decoded = null;
        try {
            decoded = URLDecoder.decode(iri.getShortForm(), encoding);
        } catch (UnsupportedEncodingException e) {
            fail(e.getMessage());
        }

        assertEquals(suffix, decoded); // decoded: Fbar
    }
}

The prefix is a standard prefix taken from this spec.

I am coming to this as some java programmer - maybe I am missing some theoretical background.

Am I doing something wrong?

sesuncedu commented 8 years ago

IRIs shouldn't have to be %escaped; there may be a bug in IRI creation code that may improperly prohibit them from being so.

The IRI creation method restricts itself to XML compatible; I don't think this limitation is required for anything other than RDF/XML predicates.

So this may be a bug. On the other hand, I believe that http://www.w3.org/2002/07/owl#fooßbar is valid , since LATIN SMALL LETTER SHARP S ( U+00DF ) is an unreserved IRI character because it matches the production rules from RFC 3987

iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar        = %xA0-D7FF / [...]

Since I believe %-escaped IRIS are not identical to the unescaped versions in RDF value-space, the unescaped approach is probably the easiest way to go.

For fixing the over-restrictive implementation of IRI, the easiest and best fix is to remove RDF/XML support (which has a beneficial side-effect of removing RDF/XML support).
The alternative is to drop all non-IRI specific restrictions on code-points, and let the RDF/XML rendering code generate exceptions for IRIs that are not supported

ignazio1977 commented 8 years ago

fooßbar encoded would look something like foo%XXXbar - the new namespace split is then at XXXbar. This is expected and cannot be avoided - I thought the requirement was to be able to use the full IRI, even with spaces and other reserved characters. Encoding would allow that but not keep the same namespace/local name values. (Note that the local name/remainder/short form values are just shorthands for convenience of developers and to help readability a bit. In OWL, the IRI is always the full IRI, not just a part of it)

As @sesuncedu suggests, in this case it might not be necessary to do any encoding. Encoding is required only for reserved characters, like spaces.

grossjonas commented 8 years ago

@sesuncedu is right. The example string was chosen poorly(my last name contains a "ß", so I became used to using it as default example). A better one would contain German cities like "Frankfurt am Main" or company names like "Schick, Neukum, Schmid, Lang Anwaltskanzlei" or "Miele & Cie. KG"

After reading all through RFC 3987 I only found this

   reserved       = gen-delims / sub-delims
   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

seem to be the only reserved characters. So spaces should not be a problem either and percent encoding seems reasonable (which is what java.net.URLEncoder mostly does). But I did not find a suggested way of encoding these reserved values in the RFC 3987. The closest might be:

3.2.  Converting URIs to IRIs
[...]
   1.  Some percent-encodings are necessary to distinguish percent-
       encoded and unencoded uses of reserved characters.

So @ignazio1977 suggestion seems to be right.

But I can't use that because of the splitting explained in @ignazio1977's last comment.

So for now I could subclass IRI and write an own IRI.create(String prefix, String suffix), which percent encodes only the reserved characters in the suffix.

But what's the problem with RDF?

I took a quick look at the RDF 1.1 XML Syntax and RDF/XML Syntax Specification (Revised), but I could not find any mentioning of "%".

Am I looking at the wrong specification again?

ignazio1977 commented 8 years ago

The problem with RDF is that IRIs used for properties need to have a local name matching an NCName syntax - in RDF/XML, the local name is used as a tag name, with a prefix if needed:

<owl:NamedIndividual rdf:about=\"somethingsomething\">
    <exampleProperty>asdf</exampleProperty>
</owl:NamedIndividual>

Since IRIs don't know if they are iris for properties or other entities. they all have to compute their remainder/local name in a way that's compatible with this requirement. It's not a great situation because the local name means different things in different contexts.

Regarding your issue, I think you might be better off using rdfs:label properties on your entities instead of storing human readable names in the IRI - that would remove the need for you to do any encoding.