Open jscissr opened 8 months ago
Hello @jscissr,
You've raised a crucial aspect of encoding that hasn't received much attention until now.
Your proposed solution may not be suitable for handling binary data. From my perspective, the appropriate approach would involve expanding the functionality of the rdb-cli
tool, particularly when dealing with JSON format, to include an option for specifying the encoding of the data. Either UTF-8 or raw data. Supporting UTF-8 would necessitate integrating a UTF-8 decoder. While it's important to have this capability, implementing it won't be a quick fix.
Yes, my solution doesn't handle binary data, but that's because JSON itself doesn't support that. If you want to support binary, you would have to base64-encode all strings.
If on the other hand all strings in your redis database are UTF-8 encoded, then my solution just passes through the strings as is, and you get a UTF-8 encoded JSON. You don't need to decode the UTF-8, you can just pass it through.
rdbtools parses UTF-8 and generates unicode escape sequences, but that is not necessary and increases file size.
If on the other hand all strings in your redis database are UTF-8 encoded
We cannot make that assumption. Unless the user explicitly requests the use of UTF-8, the tool should default to printing what could be considered as raw data to prevent potential data corruption.
The suggested conversion from UTF-8 to Unicode overlooks the fact that UTF-8 character encoding can vary in length, with sequences of up to 6 bytes.
Non-ASCII characters are encoded incorrectly by
rdb-cli dump.rdb json
.Example: Add a key to redis with
SET demo "Müller"
. Runrdb-cli dump.rdb json
. The result is:After unescaping, we get "Müller".
For comparison, rdbtools (which does not work with newer redis versions) outputs:
The simplest way to fix this is to avoid escaping non-ASCII characters entirely, and output them as is:
With this change, the result is: