protocolbuffers / protobuf

Protocol Buffers - Google's data interchange format
http://protobuf.dev
Other
63.86k stars 15.29k forks source link

Messy code when returning Chinese characters #16657

Closed Juice007 closed 4 days ago

Juice007 commented 2 weeks ago

What version of protobuf and what language are you using? Version: main/v3.6.0/v3.5.0 Language:GO、Objective-C

What operating system (Linux, Windows, ...) and version? iOS

What runtime / compiler are you using (e.g., python version or gcc version)

What did you do? Steps to reproduce the behavior: 1、When the request returns Chinese, iOS get a unicode code in the response header:

image

Decoding into Chinese is a mess

\U00e7\U0094\U00a8\U00e6\U0088\U00b7\U00e4\U00b8\U008d\U00e5\U00ad\U0098\U00e5\U009c\U00a8

mess code :

用户不存在

expected Chinese:用户不存在

puellanivis commented 2 weeks ago

It looks like this has taken UTF-8 encoded text and turned the encoded bytes into individual code points. Instead of decoding to UTF-8 and then encoding those codepoints into \U codes. When converting this to the individual bytes, the proper Chinese text is produced.

Juice007 commented 1 week ago

It looks like this has taken UTF-8 encoded text and turned the encoded bytes into individual code points. Instead of decoding to UTF-8 and then encoding those codepoints into \U codes. When converting this to the individual bytes, the proper Chinese text is produced.

Sorry. I'm still a little confused about what you mean.

Juice007 commented 1 week ago

@puellanivis Can you tell me in detail what I should do?

puellanivis commented 1 week ago

It looks like this has taken UTF-8 encoded text and turned the encoded bytes into individual code points. Instead of decoding to UTF-8 and then encoding those codepoints into \U codes. When converting this to the individual bytes, the proper Chinese text is produced.

Sorry. I'm still a little confused about what you mean.

Removing all the \U00, and then hex decoding yields the intended Chinese text: https://go.dev/play/p/KG54AtomS5p

Alternatively to the Go playground instance, thanks to the %-encoding of URIs, this can also be seen with a simple data-URI: data:,%e7%94%a8%e6%88%b7%e4%b8%8d%e5%ad%98%e5%9c%a8 (Chrome shows me the same garbled nonsense on the page, but the URI shows the correct Chinese.

Somehow the text seems to have ended up being converted from UTF-8 bytes directly into Unicode encoding points without proper decoding, à la:

func f(correctString string) string {
    buf := new(strings.Builder)
    for _, r := range []byte(correctString) {
        fmt.Fprintf(buf, "%c", r)
    }
    return buf.String()
}

https://go.dev/play/p/IPBEQzpuDce

I can’t really help you much further than pointing out that it’s the correct text, just encoded wrong (https://en.wikipedia.org/wiki/Mojibake) without any further code or such. I will note that the Originmsg appears to also be incorrectly encoded, and is the likely source of the problem with the Returnmsg. The Returnmsg is likely just simply repeating whatever it got from the Originmsg? In which case, we’re not doing anything wrong at all. The client is encoding the Originmsg wrong.

Juice007 commented 4 days ago

Thanks @puellanivis ! Our final solution is that the resp returns the urlencoded Chinese string, and the client urldecode the string, which solves the problem urlencoded string: %E7%94%A8%E6%88%B7%E4%B8%8D%E5%AD%98%E5%9C%A8%0A urldecoded string:用户不存在