Wrong Encoding Detection (UTF-8 file with German umlauts detected as Windows-949 Korean)

ThelloD commented 2 years ago

notepad3.exe incorrectly identifies a UTF-8 text file that contains German umlauts as Korean (Windows-949). All sensitive information was removed from the attached file (the original document contained more text), but the attached file still triggers the issue.

I'm using Notepad3 (x64) v5.21.1129.1 wrong-encoding.txt

Tested without (=default configuration) as well as with specifying UTF-8 as fallback: grafik

In both cases the encoding is still incorrectly identified, even after renaming the disk on file (in order to make sure that encoding information is not read from the history):

grafik

hpwamr commented 2 years ago

So, is UTF-8 correctly detected. 🤔

2022-02-09_160703

even after renaming the disk on file (in order to make sure that encoding information is not read from the history):

Tips: An easy way to avoid the "history encoding" is to: "Uncheck" --> Settings --> Remember --> Remember Recent Files

hpwamr commented 2 years ago

My opinion is that in your example, you have to many "German Umlauts" characters in 3 small lines. 😉

This sample is correctly detected as UTF-8. (just change Für to Fur)... 🤔 :

Fur X ist X zuständig.
 verlässt.
- "Verlässt

ThelloD commented 2 years ago

My opinion is that in your example, you have to many "German Umlauts" characters in 3 small lines. 😉

As explained in my post, the original file contained more text (296 Bytes instead of 52 Bytes - so still not a large file but way longer than the attached one), and I mainly left only those words with a umlaut. I was able to delete most other words while still triggering the incorrect detection, but the original file didn't contain any more umlauts.

So yeah, the attached file indeed contains an extraordinary number of umlauts, but it also happens if there are many more regular ASCII characters included. Actually, the attached file doesn't make a lot sense (none at all tbh :D), but I couldn't attach the original one because it included sensitive data and personal information. That's why I reduced the file content to a minimum ;)

RaiKoHoff commented 2 years ago

Notepad3's Encoding Detection is based on UCHARDET (https://www.freedesktop.org/wiki/Software/uchardet/). There are lots of options and parameter (confidence level to accept result, training of language specific character sets, etc.) to tune this library. Some parameter are switchable via Notepad3.ini. Obviously, there are too few characters in this file for a reliable detection. Providing some more characters (e.g. dub last line) increases the confidence value to 100% UTF-8:

If you are dealing with UTF-8 (no BOM/SIG) files only, you can switch of the encoding detector. Or you can leave a hint in the file about encoding (e.g. encoding: UTF-8), which will let the encoding-file-tag parser override the UCHARDET detection result by specified encoding:

By the way: For your provided original snippet, UCHARDET has a confidence of 99% (which is above 92% threshold for reliability) that it is "Korean (Windows-949)" - whatever Korean trainings data for the AI of UCHARDET causes this obvious wrong suggestion for decoding the byte stream.

RaiKoHoff commented 2 years ago

As of writing above comment, you wrote another comment: ASCII characters are not taken into account by encoding detection, cause they have the same encoding in all ANSI language encodings and in UTF-8. The non ASCII characters are important for correct ANSI CP-XXXX / UTF-8 encoding detection.

ThelloD commented 2 years ago

Not sure if it's actually helpful since besides umlauts the original file only contained ASCII characters (therefore might not help a lot to detect the correct encoding), but I just re-created the file with the encoding problem.

This time, it is very similar to the orginal file in terms of structure and length/size, however instead of deleting sensitive data, I replaced as much as possible using a Lorem Ipsum generator:

wrong-encoding2.txt grafik

Further, I even added 5 paragraphs, 438 words to the end of the file - it is still incorrectly detected as Korean: wrong-encoding2b-long.txt

grafik

Therefore, the issue really isn't that there have been to many umlauts compared to other characters. 😉

ThelloD commented 2 years ago

ASCII characters are not taken into account by encoding detection, cause they have the same encoding in all ANSI language encodings and in UTF-8. The non ASCII characters are important for correct ANSI CP-XXXX / UTF-8 encoding detection.

So, what are our options to improve the configuration or detection mechanism, without requiring the user to change the configuration manually? I mean, of course you could change Notepad3 configuration etc., but in my opinion it should work out of the box. For comparision, "good"-old notepad.exe, notepad2.exe and VS Code detected the file reliably.

I've tested my file again and randomly added a single "ä", and then it was detected fine. So having more special characters might help, but it's not like that German would always contain many umlauts - many sentences and sometimes paragraphs don't include even a single one at all.

And it's not like that this file was constructed as a technical proof-of-concept for an edge-case that would result in a broken encoding detection: It was just a short note with regular text that I've written using notepad3 but that couldn't be reliably opened with notepad3.exe.

For somebody tech-savy it's not a huge issue, I immediately knew that I have to manually select the encoding. But I've seen notepad2.exe as a notepad.exe replacement in a coporate setup (rolled out to all machines), and in such cases you can't expect everyone to know how this issue can be solved...

hpwamr commented 2 years ago

For comparision, "good"-old notepad.exe, notepad2.exe and VS Code detected the file reliably.

Hello @RaiKoHoff ,

I've tested the short text (wrong-encoding.txt) on 15 text editors (configuration: Out-of-the-Box):

EditPadLite, EditPlus, EmEditor, MS Notepad, Notepad++, Notepad2, Notepad2e, Notepad2-mod, Notepad2-zufuliu, Notepad3, SciTE, Sublime, TextEditorPro, UltraEdit, VSCode (13 on 15) 👍 😏

ONLY 2 text editors: "Notepad3 and SciTE" open this text with bad characters (2 on 15) 👎 🤔

craigo- commented 2 years ago

@hpwamr, I respect your scientific and rigourous methodology 👍

RaiKoHoff commented 2 years ago

For comparision, "good"-old notepad.exe, notepad2.exe and VS Code detected the file reliably.

Hello @RaiKoHoff ,

I've tested the short text (wrong-encoding.txt) on 15 text editors (configuration: Out-of-the-Box):

EditPadLite, EditPlus, EmEditor, MS Notepad, Notepad++, Notepad2, Notepad2e, Notepad2-mod, Notepad2-zufuliu, Notepad3, SciTE, Sublime, TextEditorPro, UltraEdit, VSCode (13 on 15) 👍 😏

ONLY 2 text editors: "Notepad3 and SciTE" open this text with bad characters (2 on 15) 👎 🤔

@hpwamr : And how do they all perform on our Non-UTF-8 text file test collection ? I know, Notepad++ uses UCHARDET encoding detection too (maybe other trainings data for the detector). (If someone only knows UTF-8, he is 100% right on UTF-8 files, but 0% right on other encodings, except ASCII. Or in other words UCHARDET test tries to be highly specific and sensitive (https://en.wikipedia.org/wiki/Sensitivity_and_specificity), but there is trade-off between both.)

UCHARDET is trained on natural language files - maybe this like to crack a nut with a sledgehammer for a "Programmers-Editor" 🤔

hpwamr commented 2 years ago

ONLY 2 text editors: "Notepad3 and SciTE" open this text with bad characters (2 on 15) 👎 🤔

Now Notepad3 also correctly detects your small file: (wrong-encoding.txt) --> (14 on 15) 👍 😃

Hello @ThelloD , @craigo- ,

Feel free to test the "BETA/RC PortableApps", version "Notepad3Portable_5.22.211.1_beta.paf" or newer, see 1st list in issue #1129.

Notepad3Portable_5.22.211.1_beta.paf.exe.7z -s

"Notepad3Portable BETA/RC PortableApps" version can be used with or without ".7z" extension.

Also, feel free to test the "BETA/RC Setup", version "Notepad3_5.22.211.1_beta_Setup" or newer, see the 2nd list in issue #1129.

Comments and suggestions are welcome... 😃

ThelloD commented 2 years ago

Thanks @hpwamr for providing the updated version! The good news is, that the issue is solved for the test files I've shared earlier.

However, the bad news is that another file that contains a Umlaut is now incorrectly recognized as DOS-852.

Please see attached screenshot, the latest stable version is shown on the left side, the provided beta (portable version) on the right side.

grafik

RaiKoHoff commented 2 years ago

Not reproduceable in current beta version.

rizonesoft / Notepad3

Wrong Encoding Detection (UTF-8 file with German umlauts detected as Windows-949 Korean) #3934