joshsh / ripple

Semantic Web scripting language
Other
102 stars 9 forks source link

LinkedDataSail does not dereference IRIs #57

Closed aolieman closed 10 years ago

aolieman commented 10 years ago

Hi Josh,

Many of the non-English DBpedias, and likely other LD publishers, use IRIs instead of URIs. As I understand, they implement RFC 3987 correctly by serving these resources at URIs that are the percent-encoded IRIs. Their triples, however, use the unencoded IRIs.

We can only dereference IRIs with LinkedDataSail by percent-escaping them ourselves. But, since the triples in the response use IRIs, it requires a workaround to access them (at least in Gremlin). From a LinkedDataSail in Gremlin user's perspective, the issue could be solved by percent-encoding the IRI internally, only to dereference it, but still use the IRI as the vertex id. By the way, only non-ASCII characters should be encoded. So: an apostrophe ' should not become %27.

Hopefully this example in Gremlin illustrates the problem:

gremlin> v = g.v("http://nl.dbpedia.org/resource/Mauritani\u00EB")
==>v[http://nl.dbpedia.org/resource/Mauritanië]
gremlin> v.outE // there is one edge in the cache
==>e[dbp-nl:Mauritanië - dcterms:subject -> dbp-nl:Categorie:Land]

// by mapping the IRI to its URI, it dereferences
gremlin> v = g.v("http://nl.dbpedia.org/resource/Mauritani%C3%AB")
==>v[http://nl.dbpedia.org/resource/Mauritani%C3%AB]
gremlin> v.outE // doesn't return any edges

// so now there are many more outE for Mauritanië
gremlin> v = g.v("http://nl.dbpedia.org/resource/Mauritani\u00EB")
==>v[http://nl.dbpedia.org/resource/Mauritanië]
gremlin> v.outE
==>e[dbp-nl:Mauritanië - owl:sameAs -> dbp:Mauritania]
==>e[dbp-nl:Mauritanië - owl:sameAs -> dbp-nl:Mauritanië]
==>e[dbp-nl:Mauritanië - owl:sameAs -> http://openei.org/resources/Mauritania]
  [ . . . ]
==>e[dbp-nl:Mauritanië - prop-nl:talen -> dbp-nl:Arabisch]
==>e[dbp-nl:Mauritanië - prop-nl:talen -> dbp-nl:Frans]
==>e[dbp-nl:Mauritanië - prop-nl:religie -> dbp-nl:Islam]
==>e[dbp-nl:Mauritanië - prop-nl:religie -> dbp-nl:Christendom]
==>e[dbp-nl:Mauritanië - prop-nl:tijdzone -> "+0"@nl]
==>e[dbp-nl:Mauritanië - prop-nl:feestdag -> "--11-28"^^]
==>e[dbp-nl:Mauritanië - prop-nl:landcode -> "MRT"@nl]
==>e[dbp-nl:Mauritanië - prop-nl:ciakaart -> "Mauritanie_carte.gif"@nl]
==>e[dbp-nl:Mauritanië - dbpedia-owl:code -> "MRT"@nl]

Cheers, Alex

joshsh commented 10 years ago

Thanks for this detailed issue report, Alex. I am reading (e.g. http://www.websci11.org/fileadmin/websci/Posters/98_paper.pdf) and looking into a solution in HTTPURIDereferencer.

aolieman commented 10 years ago

Thanks for expanding the dereferencer functionality to IRIs, Josh :+1: !

What I'm wondering now is how I can use the updated version in Rexster. In my naïvité, I tried a mvn clean install in the Rexster parent dir, but the output shows that ripple linkeddatasail is not updated and version 1.0 is kept. Would you know of a way I can use the new LinkedDataSail, or should I wait for a new Rexster snapshot?

joshsh commented 10 years ago

TinkerPop 2.5.0-SNAPSHOT depends on Ripple 1.1, but since that release was fairly recent and there are some major changes underway in 1.2-SNAPSHOT, it will be a little while before the dereferencer change makes it into TinkerPop. If you can't wait, you can always tweak ripple.version in the blueprints pom.xml after building Ripple locally. Thanks again for helping to improve LDS.

aolieman commented 10 years ago

In theory I could have waited, but I was way too exited to try this in Rexster. After building Ripple locally and changing Blueprints' pom.xml, I had to rebuild Blueprints and Rexster to get everything working.

One thing that surprised me though, was that I could now dereference IRIs through the Rexter CLI and the Doghouse, but all my attempts to do the same through Python failed. It turned out to be quite a simple problem: the unicode IRIs I was calling in my script needed to be encoded in Windows-1252 to work. Not a problem per se, but I found it confusing because the output is in UTF-8 (as I would expect).