Closed cboettig closed 6 years ago
@cboettig This redland library doc page states that UTF-8 is the native internal string format
so the problem must be in the conversion somewhere in the SWIG code. This could explain why some code paths preserve encoding while others don't. I don't see argument or global parameters that can be used to set the input/output (to the redland libraries) yet, but will keep looking. The SWIG + redland C code is deep!
@cboettig the string Ma\u00EBlle
contains characters \u00EB
. which is the Unicode literal string for Latin Small Letter E with diaeresis
from https://en.wikipedia.org/wiki/List_of_Unicode_characters.
This value is it's just represented incorrectly. It looks like the redland C library returned the ASCII representation of the Unicode literal string, i.e. the six characters '\u00EB' instead of the Unicode value.
I have looked at the redland C code and SWIG code and can't determine is causing this.
One possible solution would be to post process the result set and re-evaluate the string so that
Unicode literal values are correctly interpreted, although I haven't found a way that works yet. I've tried scan()
, eval()
(dangerous to use, but still didn't work).
Any ideas how to proceed with this?
Thanks for taking a look at this. I've found that just wrapping it stringi::stri_unescape_unicode
restores the rendering as Unicode rather than ASCII escaped literal. I'll close this as I guess that's an acceptable work-around.
UTF-8 characters are often mangled by redland functions:
SPARQL queries also mangle UTF-8
gives: "Ma\u00EBlle"
nquads
andntriples
fail to encode UTF-8, I get "Ma\u00EBlle" not Maëlle; try:rdfxml
andturtle
look okay: