redis / librdb

Redis RDB file parser, with JSON, RESP and RDB-loader extensions
MIT License
27 stars 8 forks source link

Wrong encoding of non-ASCII characters in JSON #40

Open jscissr opened 8 months ago

jscissr commented 8 months ago

Non-ASCII characters are encoded incorrectly by rdb-cli dump.rdb json.

Example: Add a key to redis with SET demo "Müller". Run rdb-cli dump.rdb json. The result is:

[{
    "demo":"M\u00c3\u00bcller"
}]

After unescaping, we get "Müller".

For comparison, rdbtools (which does not work with newer redis versions) outputs:

[{
"demo":"M\u00fcller"}]

The simplest way to fix this is to avoid escaping non-ASCII characters entirely, and output them as is:

diff --git a/src/ext/handlersToJson.c b/src/ext/handlersToJson.c
index c5addf7..7a7e299 100644
--- a/src/ext/handlersToJson.c
+++ b/src/ext/handlersToJson.c
@@ -65,7 +65,7 @@ static void outputPlainEscaping(RdbxToJson *ctx, char *p, size_t len) {
             case '\t': fprintf(ctx->outfile, "\\t"); break;
             case '\b': fprintf(ctx->outfile, "\\b"); break;
             default:
-                fprintf(ctx->outfile, (isprint(*p)) ? "%c" : "\\u%04x", (unsigned char)*p);
+                fprintf(ctx->outfile, ((unsigned char)*p > 127 || isprint(*p)) ? "%c" : "\\u%04x", (unsigned char)*p);
         }
         p++;
     }

With this change, the result is:

[{
    "demo":"Müller"
}]
moticless commented 7 months ago

Hello @jscissr,

You've raised a crucial aspect of encoding that hasn't received much attention until now.

Your proposed solution may not be suitable for handling binary data. From my perspective, the appropriate approach would involve expanding the functionality of the rdb-cli tool, particularly when dealing with JSON format, to include an option for specifying the encoding of the data. Either UTF-8 or raw data. Supporting UTF-8 would necessitate integrating a UTF-8 decoder. While it's important to have this capability, implementing it won't be a quick fix.

jscissr commented 7 months ago

Yes, my solution doesn't handle binary data, but that's because JSON itself doesn't support that. If you want to support binary, you would have to base64-encode all strings.

If on the other hand all strings in your redis database are UTF-8 encoded, then my solution just passes through the strings as is, and you get a UTF-8 encoded JSON. You don't need to decode the UTF-8, you can just pass it through.

rdbtools parses UTF-8 and generates unicode escape sequences, but that is not necessary and increases file size.

moticless commented 7 months ago

If on the other hand all strings in your redis database are UTF-8 encoded

We cannot make that assumption. Unless the user explicitly requests the use of UTF-8, the tool should default to printing what could be considered as raw data to prevent potential data corruption.

The suggested conversion from UTF-8 to Unicode overlooks the fact that UTF-8 character encoding can vary in length, with sequences of up to 6 bytes.