Incorrect character set detection

levicki commented 2 years ago

Consider the following text:

Superscript[¹](#notes)

Blah...

...

#notes

 - This is an anchor.

The ¹ character in the text above is Unicode Superscript One (U+00B9).

If you save this as an UTF8 file using latest version of Notepad++ on the latest version of Windows 10, then close and reopen the file it will be detected as Thai TIS-620 character set and ¹ character will be garbled:

It seems that the presence of non-alpha character (i.e. symbols or digits) immediately before ¹ is triggering the problem.

ArkadiuszMichalski commented 2 years ago

Duplicate of several other https://github.com/notepad-plus-plus/notepad-plus-plus/search?q=TIS-620&type=issues.

levicki commented 2 years ago

@ArkadiuszMichalski The oldest of those issues was submitted in 2015. Given that it has obviously been reported many times since then and yet remains unresolved, does closing this issue as a duplicate of previous unresolved issues mean it will never get resolved?

How is it even possible that anything else over the last 6 years has been more important to work on when this issue causes silent data corruption? That most certainly fits into definition of a showstopper issue.

dail8859 commented 2 years ago

How is it even possible that anything else over the last 6 years has been more important to work on when this issue causes silent data corruption?

This is assuming there are paid developers actively fixing user issues. There is not. With any open source project, a helpful person has to care enough about an issue to want it fixed...and then do so.

levicki commented 2 years ago

@dail8859 I understand how open-source development works.

What I don't understand is how this data corruption issue doesn't get any priority for 6 years and counting while people obviously voluntarily work on other stuff?

It makes as much sense as putting a new coat of pearlescent paint on a car whose brakes occasionally don't work.

With any open source project, a helpful person has to care enough about an issue to want it fixed...and then do so.

They should care about users who are facing the issue, not about the issue itself.

Without those users who use the program and report the issues, they may as well stop working on it.

On the other hand, if those "helpful persons" are only doing it for themselves as a hobby project and cherry-picking what they like working on while avoiding fixing stuff like this, then for the approx. 90% of the users it doesn't really matter if the program is open or closed-source.

Those users can't fix issues like this themselves and they ultimately don't care if the code is open or not -- they just care whether their files are getting corrupted or not and whether problem will be fixed or not. If it won't, they will find and use another program which doesn't corrupt their files.

Therefore, those "helpful persons" working on this program should therefore gather on Discord (or whatever is preferred collaboration platform these days), and decide whether they want to fix this issue or not and let us know -- it's only fair and it shouldn't be too much to ask.

alankilborn commented 2 years ago

@levicki

Can you upload your problem test.txt file directly?

levicki commented 2 years ago

@alankilborn Here you go:

test.txt

Although you could have just copy-pasted the sample text and saved it as UTF8.

As I already noted above, symbols and digits (I tried braces, brackets and parens as well as 0-9) directly preceeding ¹ character seem to trigger wrong decoding detection when you open the file. From my quick test it seems that if ¹ is preceeded by any characters in [A-Za-z] set that doesn't happen (though I admit I haven't tried the whole range, just start and end characters).

alankilborn commented 2 years ago

@levicki

Although you could have just copy-pasted the sample text and saved it as UTF8.

Your complaint was with something resulting from a file load, so...the best place to start is with providing the file.

So I would assume there is something about this data that makes it appear as valid TIS-620 data (whatever that is). It must be a strong enough appearance that the auto-detection of encoding routines pick TIS-620 over UTF-8.

I notice that if I tell Notepad++ to NOT auto-detect encoding:

then the file will load as UTF-8.

Alternatively, changing the file format to UTF-8 with BOM also works, and auto-detect can be left enabled. The concept of BOM with UTF-8 is, admittedly, a bit repugnant.

So I would suggest to you doing one of those two things.

Whether or not there is a bug here or not is a subjective matter. Without more evidence, I would say "no bug"; maybe I'd just call it an "unfortunate circumstance".

levicki commented 2 years ago

@alankilborn

Your complaint was with something resulting from a file load, so...the best place to start is with providing the file.

Yes, and I provided it as a text because it was simple and short. When you asked for a file I just copied it off the page and saved it with Notepad++, then attached it here.

I notice that if I tell Notepad++ to NOT auto-detect encoding then the file will load as UTF-8.

If I am not mistaken, it will load with whatever encoding you had selected the last time you used it which doesn't necessarily mean it will load as UTF-8. That means you will explicitly have to check it every time which defeats the purpose of having auto-detect in the first place.

Alternatively, changing the file format to UTF-8 with BOM also works...

It works unless the file is input for other software which doesn't require or support BOM for UTF-8 files. Like pretty much anything else that supports UTF-8 including HTML documents where BOM before document node will trigger quirks mode in every browser out there.

So I would suggest to you doing one of those two things.

I could have come up with those unacceptable workarounds myself, including the ultimate "Don't use Notepad++" workaround which you forgot to mention.

Whether or not there is a bug here or not is a subjective matter.

So it is a feature then? I wonder who benefits from it? 🙄

Without more evidence...

You have ample evidence of broken character set auto-detection starting from 2015 until now reported in 16 different issues.

The case I submitted is fully reproducible, what more evidence do you need that auto-detection is broken?

Here, take this file and remove .txt extension, then open it in any modern browser and tell me how many do not detect UTF-8 encoding (despite <meta charset="UTF-8"> header not being present on purpose), and how many do not show the ¹ character properly?

test.html.txt

I'll wait.

alankilborn commented 2 years ago

it will load with whatever encoding you had selected the last time you used it which doesn't necessarily mean it will load as UTF-8

Hmm, well, that's not my experience, but experiences vary.

including the ultimate "Don't use Notepad++" workaround which you forgot to mention.

You seem like a smart person; maybe you'd better exercise that workaround.

I'll wait

Don't wait too long.

Afterthought: Some additional thoughts are found here: https://community.notepad-plus-plus.org/post/34948

levicki commented 2 years ago

@alankilborn

Hmm, well, that's not my experience, but experiences vary.

It will open existing documents with whatever encoding is selected for new documents, because there is no separate option to select default encoding for opening existing documents.

So depending on what the user has configured there (and I am sure a lot of non-English speaking people might still have to use their codepage for various compatibility reasons), disabling auto-detection may not work for them as a workaround.

Afterthought: Some additional thoughts are found here

Or you know, instead of asking users to:

Turn off a broken feature
Add BOM to UTF-8 files
Use a plugin for what should be working built-in functionality in this kind of application
Not use Notepad++

Maybe you could just replace buggy existing auto-detection code based on Mozilla's legacy, poor performing, chardet (history lesson on that here) with something that will certainly work better, like Google's CED, which is used by Chrome, Google Search, and GMail?

You seem like a smart person too; maybe you'd better exercise that workaround instead of alienating users who are trying to help you improve your application?

If you are interested in trying it out, here is a small patch for CED's CMakeLists.txt (Google Test Framework download link is broken):

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 074df0f..aa7a79a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -12,10 +12,10 @@ if (WIN32)

   if (NOT EXISTS "gtest")
     message(STATUS "Google Test not present.  Fetching from the web...")
-    file(DOWNLOAD "https://github.com/google/googletest/archive/master.zip" ${CMAKE_SOURCE_DIR}/master.zip)
+    file(DOWNLOAD "https://github.com/google/googletest/archive/refs/tags/release-1.11.0.zip" ${CMAKE_SOURCE_DIR}/master.zip)
     execute_process(COMMAND ${CMAKE_COMMAND} -E tar x master.zip)
     file(REMOVE master.zip)
-    file(RENAME googletest-master gtest)
+    file(RENAME googletest-release-1.11.0 gtest)
   endif()

   # Configure gtest.

Note that Google Test Framework is used only for unit-testing and is not a dependency for the library itself. I just built it successfully using Visual Studio 2022, Platform Toolset v143, and latest Windows SDK and it passed all unit tests. You can probably also include the source code directly instead of integrating the static library project into your code base.

Don't wait too long.

Replacing existing code shouldn't take long for people familiar with Notepad++ code base. I'd say an hour of work at most.

Of course, based on my past experience with fellow software developers volunteer and professional alike when confronted with issues like this, I won't be holding my breath.

alankilborn commented 2 years ago

...trying to help you improve your application?

Well, not MY application... I'm just another user with a degree of success in helping others with their problems by offering workarounds, or extensions to Notepad++ functionality. Of course, sometimes such suggestions are not welcome or they fall flat, but usually some good discussion results. :-)

A suggestion for a different auto-detect scheme might be reasonable. But the idea might be rejected by Notepad++ developers because it could break detection that users currently count on, even if that detection is not the greatest in certain cases (yours as an example). I've noticed that the devs don't like to take risks, even to improve functionality, if the change is large or possibly interferes with backward compatibility. I suppose the only way to know for sure is for someone to attempt integration of a different autodetect engine, see if it gets accepted, and then watch to see how well it works.

levicki commented 2 years ago

@alankilborn I didn't mean literally your of course.

But the idea might be rejected ... because it could break ...

Excuses, always excuses, and not even good ones. Same with paid professionals (Microsoft .Net Framework, Visual Studio, and Windows Terminal developers that I interacted with on GitHub) and volunteers alike.

I've noticed that the devs don't like to take risks...

On the contrary, they don't give a damn about risks when something they personally like / agree with should be implemented. TL;DR -- they are biased and have huge egos just like everyone else.

I suppose the only way to know for sure is for someone to attempt integration of a different autodetect engine...

There is already a historical comparison test for Mozilla's universal chardet and Google's CED -- Firefox started losing the battle with Chrome (and that admission comes straight from FF developers at the link I posted in my previous comment), because users didn't have to choose encodings manually in Chrome.

Google even removed Encoding menu from Chrome, that's how much better CED was than uchardet, especially on short text (think page titles). Mozilla has since discarded uchardet and wrote a new one (chardetng) from scratch which works at least as good as CED, possibly better, but it is written in Rust hence my CED recommendation as a drop-in replacement.

If you insist on testing, then just take the text samples from those 16 UTF-8 misdetection issues reported over the past 6 years and run CED on them -- you don't even have to integrate it for testing, you can just slap a console executable together and call the detection function from the library on those files.

There's even easier option -- create HTML files from those text samples like I did with mine above, open them in Chrome, and see if they get detected correctly or not.

alankilborn commented 2 years ago

@levicki

You offer the devs some good advice, I think. Let's see where they take it.

ArkadiuszMichalski commented 2 years ago

@levicki CED also has its own problems. There is no perfect detector, the only solution is to implement a few and allow users to choose.

It's worth analyzing the following two threads for Notepad2 and Notepad3:

https://github.com/zufuliu/notepad2/issues/334 << "We already used very fast UTF-8 validation codes (from https://github.com/zwegner/faster-utf8-validator and https://bjoern.hoehrmann.de/utf-8/decoder/dfa/)."
- https://github.com/rizonesoft/Notepad3/issues/973 << in source I see both CED and uchardet, But looks like here is more control over them:

I checked the above example in Notepad3 and it correctly detects UTF-8 (even if I switch to OFF all above checkboxes and change default to other than UTF-8).

@donho Maybe it's worth analyzing how Notepad3 works here. This problem with incorrect UTF-8 detection as TIS-620 keeps repeating from time to time. Incorrect detection of UTF-8 (currently the most popular encoding) in fact may lead to data corruption that we may not even notice.

donho commented 2 years ago

Funny thing is, when I Googled "CED vs uchardet", the top result is Change Encoding-Detector from CED to UCHARDET of Notepad3. I'll definitely check Notepad3 to learn how they manage it better than Notepad++.

levicki commented 2 years ago

@ArkadiuszMichalski @donho Sorry if I came across as claiming that CED is perfect, that wasn't my intention. I just wanted you to reconsider changing detection code to something better, whatever that turns out to be.

Unfortunately, the reality is that all those detectors were shaped by the web and its "standards". For example, new Mozilla's chardetng omits some encodings (skip to the sub-heading titled Included and Excluded Encodings).

It is highly unlikely that, short of rolling your own, you will find a detector that works properly for all encodings.

Now that I think about it, perhaps the best solution would be not to use external detectors which are geared for the web and its subset of most commonly used encodings, but instead use the operating system provided facilities for detection?

On Windows, that could be MLang COM interface, specifically method IMultiLanguage2::DetectInputCodepage.

As for the other operating systems, they should all provide a method to do the same. If they don't, the only remaining option I see is using ICU (particularily ICU4C).

Finally, I would like to suggest that if detection fails to generate reliable result (which assumes you are using a detector which returns detection confidence), the application should say that it is unable to detect the encoding reilabily and ask the user to choose the encoding (or at least have a reasonable fallback which can be configured) instead of silently corrupting the file.

alankilborn commented 2 years ago

This may be another manifestation: https://community.notepad-plus-plus.org/post/72367

levicki commented 2 years ago

@alankilborn Not surprising, most people know nothing about character encodings and expect things to "just work".

donho commented 2 years ago

It seems the encoding problem is only on wrong detection of UTF8 to TIS-620. Do you confirm it @ArkadiuszMichalski ?

libove commented 2 years ago

Thanks @alankilborn for referring me (in my thread on the NPP Community) to this Issue #10916 And @levicki thank you for the detailed analysis, encoding detection library comparisons, history, suggestions, etc that seem to have enabled the developers in seeking a solution.

Regarding my particular posting to which Alan responded in https://github.com/notepad-plus-plus/notepad-plus-plus/issues/10916#issuecomment-1001100362 above https://community.notepad-plus-plus.org/post/72367, I had Notepad++ set to defaults in terms of encoding for new files, automatic detection, etc. And it opened a text file that I created myself on a Windows 10 system with its locale set to English (Ireland) on which I used multiple keyboards (including Spanish and Catalan).

According to Cygwin "file" this txt file is "UTF-8 Unicode text, with very long lines, with CRLF line terminators". So, NPP itself can create a file using its default configuration for new file encoding and for encoding automatic detection which it itself then misinterprets as TIS-620.

Regarding levicki's comment about people for whom the defaults may not work, I'm surely one of them: I work in English, Español, and Català, and routinely see files (albeit mostly not in "text" formats) in various other languages.

Levicki commented (correctly) that most people know nothing about character encodings and expect things to "just work". (Alan is also right that getting things to "just work" is hard. But I'll add my vote to the column for ".. but necessary in order for relatively small things to make very big projects like Notepad++ less useful or at least daily frustrating to many people). I'm aware of encodings. At least, back when I was programming decades ago I was. I'd forgotten. Because mostly things had come to "just work".

I can't offer programming or comparative library help the way you folks can. But I'm very happy to beta test. (I have excellent backups (̿▀̿ ̿Ĺ̯̿̿▀̿ ̿)̄ )

regards, Jay Libove

donho commented 2 years ago

I have checked Notepad3 to learn how it deals with this issue, then I found out an unofficial repo of uchardet is used in Notepad3. "It seems" this modified uchardet has a more refine result, since after processing, a "confidence" (float) result can be grabbed, and different results HANDLE_DATA_RESULT_DETECTED, HANDLE_DATA_RESULT_NEED_MORE_DATA and HANDLE_DATA_RESULT_ERROR can be returned.

Here are results with some tests of this modify version:

With the provided sample in this issue (the text with [¹] ), uchardet detects it always as TIS-620, and HANDLE_DATA_RESULT_NEED_MORE_DATA is returned.
With a long pure Tai text copied from Wikipedia, pasted in the Notepad++ encoded in TIS-620, saved it and reopen it - uchardet detects it always as TIS-620, and HANDLE_DATA_RESULT_NEED_MORE_DATA is returned.
We have some regression due to the update of uchardet to 0.0.6 (revert after several regressions found), tested with issue #5299, the regression remains.

I have checked in the code of Notepad3, in case of HANDLE_DATA_RESULT_NEED_MORE_DATA, it does its own additional analysis with the value of "confidence". It's not easy to assimilate into Notepad++, since the logic of processing in both software is not the same.

The newest (v0.0.7) release of official uchardet has also been tested with this issue, #5299 and #5310 - with all of them uchardet 0.0.7 generates the wrong results.

To fix UTF-8 detected wrongly as TIS-620 issue (#10958, the PR above), I disable the detection of TIS-620, to prevent from the wrong detection for the moment, for the reason of audience (the amount of UTF-8 users > the amount of TIS-620 users). TIS-620 users can always set encoding to TIS-620 manually.

Changing the module of language/encoding detection will be a goal in the middle term, if there's a better one. I'll do some investigation with this subject.

libove commented 2 years ago

If you'd like me to test, just point me to a test build (Windows 10, 64-bit). Thanks. -Jay

I have checked Notepad3 to learn how it deals with this issue, then I found out an unofficial repo of uchardet is used in Notepad3. "It seems" this modified uchardet has a more refine result, since after processing, a "confidence" (float) result can be grabbed, and different results HANDLE_DATA_RESULT_DETECTED, HANDLE_DATA_RESULT_NEED_MORE_DATA and HANDLE_DATA_RESULT_ERROR can be returned.

Here are results with some tests of this modify version:

With the provided sample in this issue (the text with [¹] ), uchardet detects it always as TIS-620, and HANDLE_DATA_RESULT_NEED_MORE_DATA is returned.

With a long pure Tai text copied from Wikipedia, pasted in the Notepad++ encoded in TIS-620, saved it and reopen it - uchardet detects it always as TIS-620, and HANDLE_DATA_RESULT_NEED_MORE_DATA is returned.

We have some regression due to the update of uchardet to 0.0.6 (revert after several regressions found), tested with issue Cannot keep utf-8 and umlauts when saving new file #5299, the regression remains.

I have checked in the code of Notepad3, in case of HANDLE_DATA_RESULT_NEED_MORE_DATA, it does its own additional analysis with the value of "confidence". It's not easy to assimilate into Notepad++, since the logic of processing in both software is not the same.

The newest (v0.0.7) release of official uchardet has also been tested with this issue, #5299 and #5310 - with all of them uchardet 0.0.7 generates the wrong results.

To fix UTF-8 detected wrongly as TIS-620 issue (#10958, the PR above), I disable the detection of TIS-620, to prevent from the wrong detection for the moment, for the reason of audience (the amount of UTF-8 users > the amount of TIS-620 users). TIS-620 users can always set encoding to TIS-620 manually.

Changing the module of language/encoding detection will be a goal in the middle term, if there's a better one. I'll do some investigation with this subject.

alankilborn commented 2 years ago

@libove

If you'd like me to test, just point me to a test build

There isn't much to "test", but you can see how to test a PR build under that heading, here: https://github.com/notepad-plus-plus/notepad-plus-plus/wiki/Testing

libove commented 2 years ago

Tested, seems to work, thank you. I opened my (already very slightly corrupted) text file. (An accented é character appears in TIS-620 as "?"). I fixed the é character and saved. Closed NPP+ PR debug. Re-opened. Shows UTF-8, shows the é. Closed. Re-opened v8.1.9.3; still TIS-620 of course; now instead of showing the accented character as "?" it shows it as a pair of Thai characters "รฉ" 😹

@libove

If you'd like me to test, just point me to a test build

There isn't much to "test", but you can see how to test a PR build under that heading, here: https://github.com/notepad-plus-plus/notepad-plus-plus/wiki/Testing

superbonaci commented 2 years ago

Still happening in latest downloadable build Notepad++ v8.1.9.3.

alankilborn commented 2 years ago

@superbonaci See here: https://github.com/notepad-plus-plus/notepad-plus-plus/issues/10492#issuecomment-1001850587

superbonaci commented 2 years ago

Working on latest build: #10960

levicki commented 2 years ago

@donho

With a long pure Tai text copied from Wikipedia, pasted in the Notepad++ encoded in TIS-620, saved it and reopen it - uchardet detects it always as TIS-620, and HANDLE_DATA_RESULT_NEED_MORE_DATA is returned.

If even in that case it returns HANDLE_DATA_RESULT_NEED_MORE_DATA then it's broken. Providing pure TIS-620 encoded sample should give full confidence result, not ask for more data.

I'd say that the only option is to not use uchardet at all, or eventually when it fails use another detector and compare results.

donho commented 2 years ago

@levicki

If even in that case it returns HANDLE_DATA_RESULT_NEED_MORE_DATA then it's broken. Providing pure TIS-620 encoded sample should give full confidence result, not ask for more data.

Yes, that's why Notepad3 has its another detection function after wrong result of uchardet call.

I'd say that the only option is to not use uchardet at all, or eventually when it fails use another detector and compare results.

There's no 100% reliable encoding detection module which exists. So far uchardet of the old version that Notepad++ is using works quite good IMO (except some bad detection and this issue). Why not to add another detection module, if very little performance will lose.

levicki commented 2 years ago

Why not to add another detection module, if very little performance will lose.

I agree, I would implement 3 detectors and use the result of the two that agree -- for example if two say it is UTF-8 and one says it is TIS-620 then it's not TIS-620.

notepad-plus-plus / notepad-plus-plus

Incorrect character set detection #10916