File encoding issues causing corruption to unicode characters

stevewgr commented 4 months ago

Description

This issue has been a thorn in my side for far too long. Let's dive into a little story:

Picture this: You're knee-deep in code, making a ton of changes that you haven't staged yet. You're on a roll, but then, out of nowhere, your text editor decides to save the file with a different encoding. Just as you're about to showcase your brilliant work to the world, you review the git diff and see a mess of random changes that have nothing to do with your actual edits. That's when you realize your text editor has messed up the encoding, and you end up yelling at your screen: "Nooooooooooooo... fml."

Here's the deal: Text editors like Visual Studio or Visual Studio Code try to guess the file encoding when opening files. But guessing isn't foolproof. Some libraries do a better job than others, but not all editors use the same libraries. For instance, Visual Studio Code relies on the jschardet library.

When your text editor fails to guess the encoding correctly, saving the file can corrupt the bytes of the text. This is especially problematic for languages with unique characters, like Korean or Chinese comments, because they get interpreted incorrectly.

In our project, we use the autoGuessEncoding feature: https://github.com/ko4life-net/ko/blob/2.4.0/.vscode/settings.json#L2. But not everyone uses the same editor or system locale, so their editors might behave differently on different machines.

This has become a significant hassle. Many pull requests end up broken due to incorrect encoding because the author didn't double-check their changes before pushing. Plus, it would be great to see the special characters text, even if they're in a different language, since we can always translate them.

Let's put an end to this encoding nightmare and ensure our code shines as it should!

Screenshots

Files

Example of broken commit: https://github.com/ko4life-net/ko/pull/161/commits/cafb59bd0c18805d9fbc427c24c633836b011ded

To Reproduce

Open a file containing Korean characters with different encoding, such as ASCII and try saving it. You'll see the characters are now corrupted.

Tasks

To fix this, we can transcode text based files with a more portable and universal encoding such as UTF-8 and UTF-16, depending on the way the files are expected to be.

[x] Transcode Microsoft resource files to UTF-16-LE
[x] Transcode all text based files to UTF-8
[x] Convert all Windows CRLF line feeds to Unix LF
[x] Implement a script that does all of this in automated fashion to make it less prone to human error

SeniourMarquies commented 4 months ago

Giving my RSCtable pull request as an example made me proud of myself. Thank you :)

stevewgr commented 4 months ago

Merged.

ko4life-net / ko