hashgraph / hedera-mirror-node

Hedera Mirror Node archives data from consensus nodes and serves it via an API
Apache License 2.0
147 stars 111 forks source link

Allow null character in string fields #5273

Open steven-sheehy opened 1 year ago

steven-sheehy commented 1 year ago

Problem

As a Hedera user, I'd like to be able to use all valid UTF-8 characters as input to Hedera string fields like topic message, memos, etc. PostgreSQL currently disallows the null character (U+0000 in UTF-8) from appearing in varchar or text columns so HAPI disallows this character in pre-check.

Solution

Possible escaping algorithm where U+0000 is the null character, U+001B is the escape character, U+FFFD is the null character replacement:

  1. U+001B -> U+001BU+001B
  2. U+FFFD -> U+001BU+FFFD
  3. U+0000 -> U+FFFD

Unescape: Reverse of above where 3) occurs first from right to left, etc. 3) occurs iff there is no escape sequence in front of it.

Alternatives

No response

steven-sheehy commented 1 year ago

I’ve been reading more about UTF-8, and saw a claim that Java actually encodes a NULL character as 0xc080 rather than as 0x00. And this was legal in the UTF-8 standard before version 3.1, but is illegal since then. If we ask Java to convert string to bytes as UTF-8, does it convert the NULL to the long form? If a string contains a NULL, will the protobuf parser from Google convert it to long form? Does our own protobuf parser do that? Does postgres complain if you use the long form, even though it’s not strictly legal UTF-8? (edited) (BTW, the basic way UTF-8 encodes would suggest that 0xc080 is an encoding of 0x0000, which looks like a longer way to say 0x00, which is why this encoding used to be legal; but since version 3.1, the standard requires you to be as compact as possible, so you have to use one byte rather than two) If postgres can handle non-strict UTF-8 with this illegal code point, then the easiest way to sanitize is to simply replace 0x00 with 0xc080 before writing to postgres, and convert back when reading. That’s much simpler than escaping. But it still can double the length of a string. If a user tries to send a 0xc080 though protobuf, will it be encoded to binary protobuf and decoded back to Java and still be a 0xc080 ?

We should answer the above questions as part of this work.