icza / screp

StarCraft - Brood War replay parser
Apache License 2.0
80 stars 21 forks source link

Korean is broken in the game title. #46

Closed kbj1213 closed 5 months ago

kbj1213 commented 1 year ago

Hi, I'm using screp.

But I have a problem.

When the game's title and map's name is in Korean, the string is broken.

I'm guessing it's an encoding issue.

Could it be revised in the next version?

icza commented 1 year ago

Sure, I can check it. Please send example replays.

kbj1213 commented 1 year ago

Here it is! https://repmastered.app/game/ZIgQt8StxDLueYfqv8-C92w2mwEWaOXRRPkCQuSUFDI

icza commented 1 year ago

The title of that game is decoded to: "3:3 鍮⑤Т 諛곗냽 �쒕�"

What should it be? What is the correct title?

kbj1213 commented 1 year ago

I don't know exactly but it starts like "3:3 빨무 .... "

icza commented 1 year ago

I checked this replay. Yes, the problem is a character encoding issue.

screp has a simple, builtin encoding detector. Basically if the raw data is a valid UTF-8 encoded text, then UTF-8 decoding is used to obtain the text. If it is not, EUC-KR (also known as Code Page 949) is also tried, and if it succeeds (no error decoding raw data using EUC-KR), then this will be the result.

In this replay, the raw title bytes are:

[51 58 51 32 235 185 168 235 172 180 32 235 176 176 236 134 141 32 235 147 156 235 163 0 0 0 0 0]

It's a 0-terminated string, so trailing zeros are cut off:

[51 58 51 32 235 185 168 235 172 180 32 235 176 176 236 134 141 32 235 147 156 235 163]

The thing is that this is supposed to be UTF-8, because decoding it as UTF-8 results in:

3:3 빨무 배속 드�

(Note the unicode replacement character (the question mark) at the end.) BUT this is INVALID UTF-8, so screp falls back to EUC-KR, and decoding it as such results in:

3:3 鍮⑤Т 諛곗냽 �쒕�

To sum it up: the raw data is supposed to be UTF-8, but the data is invalid. So screp falls back to a wrong encoding.

icza commented 1 year ago

Note that there are many Korean examples where decoding works. It would also work here using UTF-8 if the raw data wouldn't be invalid UTF-8.

icza commented 1 year ago

For example check out this game: https://repmastered.app/game/HXMAHL9rvAYiyTAgRxe2RIl87upz5XKq-U2wRGDdYmM

The game title is Korean, there's a player with Korean name (a computer), and there is also Korean chat, all properly decoded.

armoha commented 1 year ago

I think we could make an exception, to not fallback to CP949 (EUC-KR) if only last character of game title is invalid UTF-8.

Example)

33 3A 33 20 EB B9 A8 EB AC B4 20 EB B0 B0 EC 86 8D 20 EB 93 9C EF A3

Only EF A3 (235 163) are invalid UTF-8 so we try to trucate 1 character:

3:3 빨무 배속 드

No idea what actual game title would be. (235 163 xxx is not Korean character)

icza commented 1 year ago

I'd rather like to understand the cause first before adding any exceptions.

If this is supposed to be UTF-8 encoded, why is it invalid UTF-8? Is it a bug in StarCraft or something else? Generating UTF-8 encoded data isn't something rare or complex, and StarCraft does it everywhere else properly. Does this happen if certain characters used in the title?

armoha commented 1 year ago

Yeah, I also want to know exactly what is happening.

If this is supposed to be UTF-8 encoded, why is it invalid UTF-8?

First I suspected because game tittle is too long. Game title can be up to 31 characters long in battle.net (not byte limit), which is much larger than replay header stores.

So I tried very long game title but couldn't reproduce the problem. (actual game title of first link was ㅁ1ㅁ2ㅁ3ㅁ4ㅁ5ㅁ6ㅁ7ㅁ8ㅁ9ㅁ0ㅁ1ㅁ2ㅁ3ㅁ4ㅁ5ㅁ)

I also guessed 235 163 might be offset to replay file for additional game title data but this is not a case.

Does this happen if certain characters used in the title?

No idea, I downloaded and tested the replay but StarCraft: Remastered does not display game title of replay at all, so can't guess what original game title was.

icza commented 1 year ago

If there's not enough space for the title in the header, yes, the title must be truncated, which StarCraft may do so in the "middle" of an UTF-8 sequence (I haven't checked).

But there are 28 bytes reserved for map title in the header, and the "truncated" title doesn't even use all that:

[51 58 51 32 235 185 168 235 172 180 32 235 176 176 236 134 141 32 235 147 156 235 163 0 0 0 0 0]

This uses only 23 bytes, and there are 5 zeroes remaining. So it makes no sense to cut the title data in the middle of a multi-byte UTF-8 sequence...

Note that if there was an additional, non-zero byte anything from the range [128..191], it would be a valid UTF-8 encoded sequence. You can check them here: https://go.dev/play/p/-GigvCAEdV1

I don't speek Korean, but many of them seems to be a sensible title: Google translations

Some examples:

130 3:3 빨무 배속 드룂    fast speed drone
131 3:3 빨무 배속 드룃    fast speed drone
141 3:3 빨무 배속 드룍    speed drop

What happens if you use some of these titles?

armoha commented 1 year ago

But there are 28 bytes reserved for map title in the header, and the "truncated" title doesn't even use all that:

My 3 examples on last comment showed that game title only used 23 bytes, 21 bytes and 22 bytes. (Real game title was way much longer) I think SC only use 24 bytes (or 23 bytes + null terminator) max for game title, and last 4 bytes are always 0, or reserved for other usages.

I don't speek Korean, but many of them seems to be a sensible title: Google translations

Google translator and google search in Korean perform very poorly. Most Korean uses naver.com for search, and papago translator is way better.

Those examples are not sensible candidates at all for any Korean. Now my guesses of possible titles:

168 3:3 빨무 배속 드루와
189 3:3 빨무 배속 드룽와

들어와 = 'Come in' 드루와 = similar pronunciation with 들어와, but sounds informal

icza commented 1 year ago

OK, I think I have a solution here. SC:R always uses UTF-8 (outside of the map section which may come from an external source or from the "past"). So this can be decoded using forced UTF-8 since this was played using SC:R. Will update the parser.

icza commented 1 year ago

Released screp v1.11.3, forcing UTF-8 when reading game title: https://github.com/icza/screp/releases/tag/v1.11.3

icza commented 1 year ago

Also updated the parser engine of repmastered.app, and the game title is now parsed correctly: https://repmastered.app/game/ZIgQt8StxDLueYfqv8-C92w2mwEWaOXRRPkCQuSUFDI