Closed grossjonas closed 8 years ago
You can use java.net.URLEncoder
for that purpose.
Sorry, but that does not seem to work as expected. I'm from Germany and we have some special chars. I wrote a little test for that:
import org.junit.Test;
import org.semanticweb.owlapi.model.IRI;
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.nio.charset.StandardCharsets;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.fail;
public class TestIRI {
@Test
public void testIRIURLEncoding(){
final String suffix = "fooßbar";
final String encoding = StandardCharsets.UTF_8.name();
String encoded = null;
try {
encoded = URLEncoder.encode(suffix, encoding);
} catch (UnsupportedEncodingException e) {
fail(e.getMessage());
}
IRI iri = IRI.create("http://www.w3.org/2002/07/owl#", encoded);
String decoded = null;
try {
decoded = URLDecoder.decode(iri.getShortForm(), encoding);
} catch (UnsupportedEncodingException e) {
fail(e.getMessage());
}
assertEquals(suffix, decoded); // decoded: Fbar
}
}
The prefix is a standard prefix taken from this spec.
I am coming to this as some java programmer - maybe I am missing some theoretical background.
Am I doing something wrong?
IRIs shouldn't have to be %escaped; there may be a bug in IRI creation code that may improperly prohibit them from being so.
The IRI creation method restricts itself to XML compatible; I don't think this limitation is required for anything other than RDF/XML predicates.
So this may be a bug. On the other hand, I believe that http://www.w3.org/2002/07/owl#fooßbar
is valid , since LATIN SMALL LETTER SHARP S ( U+00DF ) is an unreserved IRI character because it matches the production rules from RFC 3987
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar = %xA0-D7FF / [...]
Since I believe %-escaped IRIS are not identical to the unescaped versions in RDF value-space, the unescaped approach is probably the easiest way to go.
For fixing the over-restrictive implementation of IRI, the easiest and best fix is to remove RDF/XML support (which has a beneficial side-effect of removing RDF/XML support).
The alternative is to drop all non-IRI specific restrictions on code-points, and let the RDF/XML rendering code generate exceptions for IRIs that are not supported
fooßbar encoded would look something like foo%XXXbar - the new namespace split is then at XXXbar. This is expected and cannot be avoided - I thought the requirement was to be able to use the full IRI, even with spaces and other reserved characters. Encoding would allow that but not keep the same namespace/local name values. (Note that the local name/remainder/short form values are just shorthands for convenience of developers and to help readability a bit. In OWL, the IRI is always the full IRI, not just a part of it)
As @sesuncedu suggests, in this case it might not be necessary to do any encoding. Encoding is required only for reserved characters, like spaces.
@sesuncedu is right. The example string was chosen poorly(my last name contains a "ß", so I became used to using it as default example). A better one would contain German cities like "Frankfurt am Main" or company names like "Schick, Neukum, Schmid, Lang Anwaltskanzlei" or "Miele & Cie. KG"
After reading all through RFC 3987 I only found this
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
seem to be the only reserved characters. So spaces should not be a problem either and percent encoding seems reasonable (which is what java.net.URLEncoder
mostly does).
But I did not find a suggested way of encoding these reserved values in the RFC 3987.
The closest might be:
3.2. Converting URIs to IRIs
[...]
1. Some percent-encodings are necessary to distinguish percent-
encoded and unencoded uses of reserved characters.
So @ignazio1977 suggestion seems to be right.
But I can't use that because of the splitting explained in @ignazio1977's last comment.
So for now I could subclass IRI and write an own IRI.create(String prefix, String suffix)
, which percent encodes only the reserved characters in the suffix.
But what's the problem with RDF?
I took a quick look at the RDF 1.1 XML Syntax and RDF/XML Syntax Specification (Revised), but I could not find any mentioning of "%".
Am I looking at the wrong specification again?
The problem with RDF is that IRIs used for properties need to have a local name matching an NCName syntax - in RDF/XML, the local name is used as a tag name, with a prefix if needed:
<owl:NamedIndividual rdf:about=\"somethingsomething\">
<exampleProperty>asdf</exampleProperty>
</owl:NamedIndividual>
Since IRIs don't know if they are iris for properties or other entities. they all have to compute their remainder/local name in a way that's compatible with this requirement. It's not a great situation because the local name means different things in different contexts.
Regarding your issue, I think you might be better off using rdfs:label
properties on your entities instead of storing human readable names in the IRI - that would remove the need for you to do any encoding.
Hi,
i tried to put some values into the suffix of
IRI.create(String prefix, String suffix)
and get them back by callinggetShortForm()
orgetFragment()
on the returned instance. I see that many codepoints are getting checked by variousXMLUtils
-Methods, butescapeXML(CharSequence s)
only escapes a few of them.Is there any generic way to escape or mask values like whitespaces, dots and so on for the suffix?