Closed ThelloD closed 1 year ago
So, is UTF-8 correctly detected. π€
even after renaming the disk on file (in order to make sure that encoding information is not read from the history):
Tips: An easy way to avoid the "history encoding" is to: "Uncheck" --> Settings --> Remember --> Remember Recent Files
My opinion is that in your example, you have to many "German Umlauts" characters in 3 small lines. π
This sample is correctly detected as UTF-8. (just change FΓΌr to Fur)... π€ :
Fur X ist X zustΓ€ndig.
verlΓ€sst.
- "VerlΓ€sst
My opinion is that in your example, you have to many "German Umlauts" characters in 3 small lines. π
As explained in my post, the original file contained more text (296 Bytes instead of 52 Bytes - so still not a large file but way longer than the attached one), and I mainly left only those words with a umlaut. I was able to delete most other words while still triggering the incorrect detection, but the original file didn't contain any more umlauts.
So yeah, the attached file indeed contains an extraordinary number of umlauts, but it also happens if there are many more regular ASCII characters included. Actually, the attached file doesn't make a lot sense (none at all tbh :D), but I couldn't attach the original one because it included sensitive data and personal information. That's why I reduced the file content to a minimum ;)
Notepad3's Encoding Detection is based on UCHARDET (https://www.freedesktop.org/wiki/Software/uchardet/). There are lots of options and parameter (confidence level to accept result, training of language specific character sets, etc.) to tune this library. Some parameter are switchable via Notepad3.ini. Obviously, there are too few characters in this file for a reliable detection. Providing some more characters (e.g. dub last line) increases the confidence value to 100% UTF-8:
If you are dealing with UTF-8 (no BOM/SIG) files only, you can switch of the encoding detector. Or you can leave a hint in the file about encoding (e.g. encoding: UTF-8), which will let the encoding-file-tag parser override the UCHARDET detection result by specified encoding:
By the way: For your provided original snippet, UCHARDET has a confidence of 99% (which is above 92% threshold for reliability) that it is "Korean (Windows-949)" - whatever Korean trainings data for the AI of UCHARDET causes this obvious wrong suggestion for decoding the byte stream.
As of writing above comment, you wrote another comment: ASCII characters are not taken into account by encoding detection, cause they have the same encoding in all ANSI language encodings and in UTF-8. The non ASCII characters are important for correct ANSI CP-XXXX / UTF-8 encoding detection.
Not sure if it's actually helpful since besides umlauts the original file only contained ASCII characters (therefore might not help a lot to detect the correct encoding), but I just re-created the file with the encoding problem.
This time, it is very similar to the orginal file in terms of structure and length/size, however instead of deleting sensitive data, I replaced as much as possible using a Lorem Ipsum generator:
Further, I even added 5 paragraphs, 438 words to the end of the file - it is still incorrectly detected as Korean: wrong-encoding2b-long.txt
Therefore, the issue really isn't that there have been to many umlauts compared to other characters. π
ASCII characters are not taken into account by encoding detection, cause they have the same encoding in all ANSI language encodings and in UTF-8. The non ASCII characters are important for correct ANSI CP-XXXX / UTF-8 encoding detection.
So, what are our options to improve the configuration or detection mechanism, without requiring the user to change the configuration manually? I mean, of course you could change Notepad3 configuration etc., but in my opinion it should work out of the box. For comparision, "good"-old notepad.exe, notepad2.exe and VS Code detected the file reliably.
I've tested my file again and randomly added a single "Γ€", and then it was detected fine. So having more special characters might help, but it's not like that German would always contain many umlauts - many sentences and sometimes paragraphs don't include even a single one at all.
And it's not like that this file was constructed as a technical proof-of-concept for an edge-case that would result in a broken encoding detection: It was just a short note with regular text that I've written using notepad3 but that couldn't be reliably opened with notepad3.exe.
For somebody tech-savy it's not a huge issue, I immediately knew that I have to manually select the encoding. But I've seen notepad2.exe as a notepad.exe replacement in a coporate setup (rolled out to all machines), and in such cases you can't expect everyone to know how this issue can be solved...
For comparision, "good"-old notepad.exe, notepad2.exe and VS Code detected the file reliably.
Hello @RaiKoHoff ,
I've tested the short text (wrong-encoding.txt) on 15 text editors (configuration: Out-of-the-Box):
ONLY 2 text editors: "Notepad3 and SciTE" open this text with bad characters (2 on 15) π π€
@hpwamr, I respect your scientific and rigourous methodology π
For comparision, "good"-old notepad.exe, notepad2.exe and VS Code detected the file reliably.
Hello @RaiKoHoff ,
I've tested the short text (wrong-encoding.txt) on 15 text editors (configuration: Out-of-the-Box):
- EditPadLite, EditPlus, EmEditor, MS Notepad, Notepad++, Notepad2, Notepad2e, Notepad2-mod, Notepad2-zufuliu, Notepad3, SciTE, Sublime, TextEditorPro, UltraEdit, VSCode (13 on 15) π π
ONLY 2 text editors: "Notepad3 and SciTE" open this text with bad characters (2 on 15) π π€
@hpwamr : And how do they all perform on our Non-UTF-8 text file test collection ? I know, Notepad++ uses UCHARDET encoding detection too (maybe other trainings data for the detector). (If someone only knows UTF-8, he is 100% right on UTF-8 files, but 0% right on other encodings, except ASCII. Or in other words UCHARDET test tries to be highly specific and sensitive (https://en.wikipedia.org/wiki/Sensitivity_and_specificity), but there is trade-off between both.)
UCHARDET is trained on natural language files - maybe this like to crack a nut with a sledgehammer for a "Programmers-Editor" π€
ONLY 2 text editors: "Notepad3 and SciTE" open this text with bad characters (2 on 15) π π€
Now Notepad3 also correctly detects your small file: (wrong-encoding.txt) --> (14 on 15) π π
Hello @ThelloD , @craigo- ,
Feel free to test the "BETA/RC PortableApps", version "Notepad3Portable_5.22.211.1_beta.paf" or newer, see 1st list in issue #1129.
"Notepad3Portable BETA/RC PortableApps" version can be used with or without ".7z" extension.
Also, feel free to test the "BETA/RC Setup", version "Notepad3_5.22.211.1_beta_Setup" or newer, see the 2nd list in issue #1129.
Comments and suggestions are welcome... π
Thanks @hpwamr for providing the updated version! The good news is, that the issue is solved for the test files I've shared earlier.
However, the bad news is that another file that contains a Umlaut is now incorrectly recognized as DOS-852.
Please see attached screenshot, the latest stable version is shown on the left side, the provided beta (portable version) on the right side.
Not reproduceable in current beta version.
notepad3.exe incorrectly identifies a UTF-8 text file that contains German umlauts as Korean (Windows-949). All sensitive information was removed from the attached file (the original document contained more text), but the attached file still triggers the issue.
I'm using Notepad3 (x64) v5.21.1129.1 wrong-encoding.txt
Tested without (=default configuration) as well as with specifying UTF-8 as fallback:
In both cases the encoding is still incorrectly identified, even after renaming the disk on file (in order to make sure that encoding information is not read from the history):