icza / screp

StarCraft - Brood War replay parser
Apache License 2.0
80 stars 21 forks source link

Players with non-Latin characters have all ????'s as names #27

Closed msikma closed 2 years ago

msikma commented 3 years ago

Hi there. I've been writing some code with Screp, which is a really awesome tool. I've ran into some weirdness, though. Japanese or Korean player names (probably other languages too) show up as all ???'s in the tool.

For example, here's a replay I made just now playing against the computer in Japanese: 162215,(8)The Hunters.rep.zip

Here's my output with Screp.

The list of players:

$ screp ~/Games/StarCraft/Replays/_test_lang/jp/"162215,(8)The Hunters.rep" | jq -c ".Header.Players[] | [.Race.Name[0:1], .Name]"

["Z","??????????"]
["P","?????????"]
["T","??????"]
["T","?????"]
["Z","????????"]
["P","?????????"]
["Z","???????????"]
["P","Dada"]

This is what the expected values are (this is what you see looking at the file in the in-game replay browser):

Z グレンデル・ブルード
P アウリガ・トライブ
T アルファ部族
T アンティガ
Z スルト・ブルード
P アキレイ・トライブ
Z レヴィアタン・ブルード
P Dada

Would anyone have an idea of what's wrong? Is it a problem with the tool or did I miscompile it?

My version:

screp version: v1.5.0
Parser version: v1.6.1
EAPM algorithm version: v1.0.4
Platform: darwin amd64
Built with: go1.17
Author: Andras Belicza
Home page: https://github.com/icza/screp
icza commented 3 years ago

Hi. I get the same output.

The problem with replays is that the encoding used to encode certain strings (like player names, map name etc) is not recorded in the replay.

Most often it is UTF-8, which is what screp tries first. If a text is not valid UTF-8, it falls back to EUC-KR (Code Page 949). But that's the end of it. Other encodings are not handled by screp.

msikma commented 3 years ago

How weird! Seems very old fashioned of Blizzard to include text in something other than UTF-8.

The encoding does not seem to work for Korean either, with 음역 or 완역 (although surely they'll use the same encoding), so the fallback encoding does not seem to work for it at least on replays made with the modern version of the game.

Actually, there aren't that many languages in SC:R, and only 6 of them use non-Latin characters (2 of them are both Chinese, 2 of them Korean, which probably brings it to 4 encodings total), so maybe it's not such a big task to just find out what encodings they use and add them as fallbacks. I'd be willing to do the testing, if only I could find out what the raw bytes for the names are.

I'll have a look to see if I can get screp to output the raw bytes as an array of ints so I can do this testing. Although I'm not a Go user and I'm not familiar with the codebase so any help in this regard would be appreciated 🙂

icza commented 3 years ago

I'm operating https://repmastered.app. which uses screp as the parsing engine.

The site holds well above a million replays, and a major of them is saved with the modern version of the game. And a major of them are Korean replays, some Chinese replays, and decoding Korean and Chinese names works just fine.

Btw, the raw Player names are also retained, it is just not outputted by the CLI tool. See https://github.com/icza/screp/blob/master/rep/header.go, Player.RawName field.

icza commented 3 years ago

This is not a question of old or modern client, but likely a per-client encoding setting.

msikma commented 3 years ago

That's interesting. I think this might be a problem with the Mac OS X client, which is what I use. I'm sure most players are on Windows, especially Korean players. I'll test this when I'm able. edit: tested, not a Mac issue, same thing happens on Windows.

Btw, the raw Player names are also retained, it is just not outputted by the CLI tool. See https://github.com/icza/screp/blob/master/rep/header.go, Player.RawName field.

Yes, I was hoping maybe it'd be possible to somehow output the raw bytes so I can have a look at that's going wrong, even if it's just in my own build of the cli tool and only for debugging purposes.

icza commented 2 years ago

Heads-up: This issue will be fixed in the next release.

icza commented 2 years ago

Happy to report that I released screp v1.5.1 which now properly parses these player names.

msikma commented 2 years ago

Thanks, awesome!