goto-bus-stop / recanalyst

Analyzes Age of Empires 2 recorded game files.
https://goto-bus-stop.github.io/recanalyst/doc/v4.2.0
GNU General Public License v3.0
76 stars 11 forks source link

Encoding with player names and chat messages #55

Open lichifeng opened 7 years ago

lichifeng commented 7 years ago

It seems game records generated on different Windows versions have different character encodings. Especially when dealing with game records from non-latin users, character encoding is a headache. Player names and chat messages cannot always display correctly.

I tried to resolve this with mb_detect_encoding() and mb_convert_encoding() but failed. It is hard for mb_detect_encoding() to make a good guess( Maybe a player name is too short? )

Since I mainly use recanalyst to analyze game records from chinese users, I simply decode strings extracted from records with GBK( common encoding for chinese characters) and then encode them with UTF-8. The result of this solution is acceptable for me, but apparently dirty and not elegant.

So here is my question: Is there a way to know encoding of strings in records explicitly?

Thanks.

goto-bus-stop commented 7 years ago

Thanks for bringing this up! Recorded games don't store the encoding but we may be able to guess it by comparing the text from the Objectives tab with some predefined language strings for example. RecAnalyst does that for map names here. I think it'd be ideal to solve this in RecAnalyst, so that player names will always be returned as utf8.