otland / forgottenserver

A free and open-source MMORPG server emulator written in C++
https://otland.net
GNU General Public License v2.0
1.57k stars 1.05k forks source link

Encode Latin-1 to UTF-8 and vice versa #4628

Closed amatria closed 6 months ago

amatria commented 6 months ago

Pull Request Prelude

Changes Proposed

The issue at hand is that since #2403 (which addressed issue #2175), all text in the database has been stored using UTF-8 encoding. However, the data that comes from the client is Latin-1 encoded. This isn't problematic as long as the characters from the client stay within the ASCII range, as both Latin-1 and UTF-8 encode these characters the same way. However, if we receive from the client a string of text containing characters beyond the ASCII range (e.g., "ñññççç"), this is where data loss occurs, resulting in the situation described by the original poster in #4560. We end up storing in the database a set of characters that are not UTF-8 encoded.

This commit addresses this issue by implementing explicit conversion mechanisms to handle text encoding conversions seamlessly during data transmission between the client and server.

The solution at hand may incur too much overhead and should be evaluated.

How to reproduce

  1. From server to client (character name is not displayed correctly):

https://github.com/otland/forgottenserver/assets/34030065/1432cb35-dda6-4bb7-97d9-6d29ea8dc8c0

  1. From client to server (can't login because the password hash is Latin-1 encoded):

https://github.com/otland/forgottenserver/assets/34030065/0587773c-de43-4f8b-a363-629b364b1ca3

Issues addressed: #4560

ranisalt commented 6 months ago

I'd rather use Boost.Locale to do that with less code

Edit: more specifically, std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string, "Latin1") does the trick

amatria commented 6 months ago

I'd rather use Boost.Locale to do that with less code

Edit: more specifically, std::string utf8_string = boost::locale::conv::to_utf<char>(latin1_string, "Latin1") does the trick

Done. TYSM for the suggestion

nekiro commented 6 months ago

shouldn't this happen before query is pushed to db instead of message parse?

amatria commented 6 months ago

shouldn't this happen before query is pushed to db instead of message parse?

Quoting my comment in #4560:

[…]

One possible solution is to use the CONVERT() function in SQL to convert text from Latin-1 to UTF-8 before storing it in the database:

INSERT INTO table (utf8_column) VALUES (CONVERT('latin1_text_ñññççç' USING utf8));

Similarly, we can use CONVERT() to convert the UTF-8 encoded text back to Latin-1 encoding:

SELECT CONVERT(utf8_column USING latin1) AS latin1_column FROM table;

However, I'm hesitant to consider this solution as it may introduce compatibility issues with existing code, which currently assumes that strings are encoded in UTF-8 format. Another approach could involve addressing the encoding mismatch at the game protocol level, implementing explicit conversion mechanisms to handle text encoding conversions seamlessly during data transmission between the client and server.

Thoughts?

ranisalt commented 6 months ago

It is better to handle everything in UTF-8 in the server and convert from as soon (or to as late) as possible, since UTF-8 is the de facto standard. We don't want to handle Latin1 things everywhere.