UTF8 issues arise from serializing to nquads, ntriples, but not in rdf/xml, turtle

cboettig commented 6 years ago

UTF-8 characters are often mangled by redland functions:

library(redland)
world <- new("World")
storage <- new("Storage", world, "hashes", name="", options="hash-type='memory'")
model <- new("Model", world, storage, options="")

stmt <- new("Statement", 
            world = world,  
            subject="", 
            predicate="http://schema.org/name", 
            object="Maëlle Salmon")
addStatement(model, stmt)

SPARQL queries also mangle UTF-8

query <-'SELECT ?o WHERE { ?s ?p ?o}'
queryObj <- new("Query", world, query)
queryResult <- executeQuery(queryObj, model)
r <-getNextResult(queryResult)

gives: "Ma\u00EBlle"

nquads and ntriples fail to encode UTF-8, I get "Ma\u00EBlle" not Maëlle; try:

serializer <- new("Serializer", world, name = "nquads", mimeType = "text/x-nquads")
redland::serializeToFile(serializer, world, model, "test.rdf")
cat(readLines("test.rdf"))

serializer <- new("Serializer", world, name = "ntriples", mimeType = "application/n-triples")
redland::serializeToFile(serializer, world, model, "test.rdf")
cat(readLines("test.rdf"))

rdfxml and turtle look okay:

serializer <- new("Serializer", world, name = "turtle", mimeType = "text/turtle")
redland::serializeToFile(serializer, world, model, "test.rdf")
cat(readLines("test.rdf"))

serializer <- new("Serializer", world)
redland::serializeToFile(serializer, world, model, "test.rdf")
cat(readLines("test.rdf"))

gothub commented 6 years ago

@cboettig This redland library doc page states that UTF-8 is the native internal string format so the problem must be in the conversion somewhere in the SWIG code. This could explain why some code paths preserve encoding while others don't. I don't see argument or global parameters that can be used to set the input/output (to the redland libraries) yet, but will keep looking. The SWIG + redland C code is deep!

gothub commented 6 years ago

@cboettig the string Ma\u00EBlle contains characters \u00EB. which is the Unicode literal string for Latin Small Letter E with diaeresis from https://en.wikipedia.org/wiki/List_of_Unicode_characters.

This value is it's just represented incorrectly. It looks like the redland C library returned the ASCII representation of the Unicode literal string, i.e. the six characters '\u00EB' instead of the Unicode value.

I have looked at the redland C code and SWIG code and can't determine is causing this. One possible solution would be to post process the result set and re-evaluate the string so that Unicode literal values are correctly interpreted, although I haven't found a way that works yet. I've tried scan(), eval() (dangerous to use, but still didn't work).

Any ideas how to proceed with this?

cboettig commented 6 years ago

Thanks for taking a look at this. I've found that just wrapping it stringi::stri_unescape_unicode restores the rendering as Unicode rather than ASCII escaped literal. I'll close this as I guess that's an acceptable work-around.

ropensci / redland-bindings

UTF8 issues arise from serializing to nquads, ntriples, but not in rdf/xml, turtle #62