UTF Default encoding problem with cyrillic ansi files

muhanov-apps commented 6 years ago

Hello!

Notepad3 has default ANSI encoding and works fine with cyrillic text content in ANSI 1251 files. But when we set Default Encoding to UTF8 these files opens with this

privet xEFxF0xE8xE2xE5xF2

instead of

privet привет

and shows UTF encoding format in bottom bar

Is it possible to fix detection of ansi encoding in files when UTF is set default for new ones? Or it's Scintilla problem?

RaiKoHoff commented 6 years ago

Hi, it is not a Scintilla problem. It is a problem of the UTF-8 detection method: ANSI files and UTF-8 (w/o Signature) are the same, if they are using the first 127 ASCII characters . If the document uses only the first 127 characters (ASCII), it does not matter if it is loaded as ANSI (independent from code page (CP)) or UTF-8, the look the same. On the other hand, if the file reader finds a byte sequence, which is not ASCII, it is difficult to decide wether it is an UTF-8 or ANSI sequence, and if it is ANSI, to which Code-Page does it fit? So the encoding classifier needs more help (reading more bytes from file stream, or compare it with predefined Code-Page (e.g. ANSI Cyrillic CP1251) or using Default-Settings (UTF-8), if it is not able to decide. I have to admit, that I didn't take a look at Notepad2/3's encoding classifier, I guess that the classifier if not perfect. But maybe an enhancement could help (w/o touching the classifier):

leave the Default Encoding the same as your Windows Codepage (ANSI Cyrillic CP1251)
(to be implemented !): Have a 2nd Default Encoding Setting (e.g. UTF-8) for creating new files.

Your workaround for the current being is:

Keep Default Encoding at UTF-8
Recode wrong detected files with File -> Encoding -> Recode to ANSI (Ctrl+Shift+A), which reloads current file with Windows Default ANSI Code Page immediately.

data-man commented 6 years ago

@RaiKoHoff Can I suggest the good library for an encoding classifying? :) Compact Encoding Detection by Google. Also tellenc can be helpful & improved.

muhanov-apps commented 6 years ago

@RaiKoHoff take a look at encoding classifier in the latest notepad2-mod 4.2.25.998 It works as expected. And changes encoding from default UTF-8 to ANSI when I open that example file

RaiKoHoff commented 6 years ago

Notepad2-mod prefers ANSI with current Windows CodePage over UTF-8 or Default Settings, if in doubt. On my machine NP2-mod opens your file with "ANSI CP-1252", which gives wrong results too. Notepad3 prefers "Default Settings" over "ANSI current CodePage", assuming that files were encoded in that "Default" encoding format. I am going to reconsider this decision ...

By the way: If you re-code your file as "ANSI Cyrillic CP-1251" and save it again, Notepad3 remembers that format, as long as it is available from History (Alt+N).

@data-man : Thank you for suggestion (Onigmo was a nice one), if I find some time, I can take a closer look on that.

muhanov-apps commented 6 years ago

ANSI with current Windows CodePage

I think this is what this setting for )

RaiKoHoff commented 6 years ago

@data-man : tellenc is a promising single C-source file containing a byte-sequence analyzer based on some country specific ANSI byte tuple occurrences. @muhanov-apps : But this database lacks data for Cyrillic (CP1251) - if you are able to provide typical character pairs and their ANSI "windows-1251" resp. "cp437" encodings, this detector can be enhanced:

static freq_analysis_data_t freq_analysis_data[] = { { 0x9a74, "windows-1250" }, // "št" (Czech) { 0xe865, "windows-1250" }, // "če" (Czech) { 0xf865, "windows-1250" }, // "ře" (Czech) { 0xe167, "windows-1250" }, // "ág" (Hungarian) { 0xe96c, "windows-1250" }, // "él" (Hungarian) { 0xb36f, "windows-1250" }, // "ło" (Polish) { 0xea7a, "windows-1250" }, // "ęz" (Polish) { 0xf377, "windows-1250" }, // "ów" (Polish) { 0x9d20, "windows-1250" }, // "ť " (Slovak) { 0xfa9d, "windows-1250" }, // "úť" (Slovak) { 0x9e69, "windows-1250" }, // "ži" (Slovenian) { 0xe869, "windows-1250" }, // "či" (Slovenian) { 0xe020, "windows-1252" }, // "à " (French) { 0xe920, "windows-1252" }, // "é " (French) { 0xe963, "windows-1252" }, // "éc" (French) { 0xe965, "windows-1252" }, // "ée" (French) { 0xe972, "windows-1252" }, // "ér" (French) { 0xe4e4, "windows-1252" }, // "ää" (Finnish) { 0xe474, "windows-1252" }, // "ät" (German) { 0xfc72, "windows-1252" }, // "ür" (German) { 0xed6e, "windows-1252" }, // "ín" (Spanish) { 0xf36e, "windows-1252" }, // "ón" (Spanish) { 0x8220, "cp437" }, // "é " (French) { 0x8263, "cp437" }, // "éc" (French) { 0x8265, "cp437" }, // "ée" (French) { 0x8272, "cp437" }, // "ér" (French) { 0x8520, "cp437" }, // "à " (French) { 0x8172, "cp437" }, // "ür" (German) { 0x8474, "cp437" }, // "ät" (German)

RaiKoHoff commented 6 years ago

After some debugging and code analysis: Notepad3 is correct, Notepad2-mod is wrong 😀 , cause:

The dialog Encoding -> Set Default... obviously specifies default settings for file opening:

This implies: "If you are not able to detect the file encoding, use this encoding setting". (A side-effect is, that this encoding is used on new file creation time too).

Both, Notepad2-mod and Notepad3, are not able to detect the correct encoding, so:

Notepad3: uses the Default Encoding defined by the user 👍
Notepad2-mod: ignores the user's setting and uses a fallback to Regional Windows Code Page 👎

So, regarding correct file encoding:

Notepad3 fails - the user gives a wrong advice. 👎
Notepad2-mod succeeded - by "accidentally" choosing the correct encoding for that file. 🤔
Notepad2-mod fails in every case (using this detection fallback), where the encoding is different to the Regional Windows Code Page.

Now an endless discussion may start about "Default Encoding should be UTF-8, not regional dependent Code Pages, etc., pp."... The point is: the solution for your problem is (besides a better code page detector (tellenc), you need a 2nd Default Encoding setting for creating files. (To smoothly handle both worlds (Unicode UTF-8 vs. Regional Code Page) concurrently - I think this is the reason, Microsoft extended the UTF-8 Spec. by UTF-8 Signature, to have a fail-safe encoding detection).

By the way: You may specify encoding tags, to help NP2/3 on encoding detection: ([Settings] NoEncodingTags=0 ; for using them ; and not switched OFF in Set Default... dialog)

encoding:windows-1251 привет

data-man commented 6 years ago

@RaiKoHoff I hope that within a few days I will improve the tellenc's results. The problem is that need to have a lot of files for different languages (Russian, Ukrainian, Belarusian and many others) and for each of them the tellenc will give different results.

Do you think the Compact Encoding Detection library is too large?

muhanov-apps commented 6 years ago

@RaiKoHoff I completely agree with you decision to do encoding detection in "right way". But... but... imho in real life it's not that bad to use current Regional Windows Code Page. As non-english speaking person I choose it especially because I "want" programs to use it in undetermined situations. I (and so does 99.9% others) never, never in my 15 years of programming life opened ansi files in code pages other than my native language 1251 (which is configured in windows regional settings) and english ones. So as the last possible option... I think it's acceptable

RaiKoHoff commented 6 years ago

First I have to apologize, cause I have been a little bit provocatively, just to attract more people to discuss this topic. Sometimes, I try to take a look from the diametrical point of view - the opposite of my opinion, just to get a bigger picture, maybe I change my opinion. My opinion is: if I am not able to detect the encoding (vote for better encoding detection), I have to use a fallback. In this case:

use ANSI encoding (open question: which Code Page - Preference: Regional Setting's Code Page)
use UTF-8 (no Signature, otherwise I would have detected it) encoding

So the decision made in:

Notepad2-mod is: use ANSI - Regional Setting's Code Page,
Notepad3 is: use user's settings for default encoding (for new files).

I will go to enhance the encoding detector, but the above decision is still open in case of detection failure. Which way shall we go?

@data-man : Yes I thing the "Compact Encoding Detection" by Google Inc. is a "jack of all trades device" (in German: "eierlegende Wollmilchsau - direct translation: oviparous sow, giving ham, wool and milk"). It's generated tables and case differentiations (~1M) will blow up the executable a lot. It seems like using a sledgehammer to crack a nut.

I like the lean and mean way of tellenc's heuristic, even if it will be not perfect. It will be getting better, if you are going to enhance the heuristic by missing languages.

RaiKoHoff commented 6 years ago

The beta version 3.18.301.913 is online. I reverted the behavior of the encoding detector and added a new option, to restore Notepad3 former behavior (this is a change of behavior, cause it is switched off by default):

Notepad3Portable_3.18.301.913.zip

Please test this version. @data-man : tellenc is not implemented (yet) in this version.

muhanov-apps commented 6 years ago

@RaiKoHoff beta works fine on opening 1251.

RaiKoHoff commented 6 years ago

@muhanov-apps : To answer your reply (https://github.com/rizonesoft/Notepad3/issues/387#issuecomment-369513002): The main problem is not to distinguish between different ANSI CP Encodings (in 99.9% it is the system's CP), but to detect if it is ANSI or UTF-8 (no Sig). There are good reasons to prefer UTF-8 over ANSI (see UTF-8 Everywhere manifesto), I trimmed Notepad3 more in the direction of UTF-8 - maybe a little bit too much 🤔 😄 Hope that @data-man 's enhanced tellenc will be a significant progress...

RaiKoHoff commented 6 years ago

@data-man : I integrated the "tellenc" code (migrated from C++ to plain C) , currently as an "assist" on suggestions for ANSI Code Page detection, if the traditional detection chain fails. @craigo- , @AlexIljin, @data-man , et al. : Since there has been a major impact on the code due to refactoring, a thoroughly test of the versions in next week will be highly appreciated :grin: . First to start with new beta : 3.18.302.915: Notepad3Portable_3.18.302.915.zip For testers, who try to test the enhanced Code Page detection feature: keep in mind, that the file history stores the encoding along with the file name - this encoding overrides possible detection results, so it has to be deleted to rely on the detection feature. @data-man : as a first draft, I used the KOI8-R Cyrillic master dblByte sequences (of tellenc), migrated them to CP-1251 and used them for CP-1251 detection. Nevertheless your future work on this topic will be integrated ...

muhanov-apps commented 6 years ago

@RaiKoHoff still works with 1251 ))

RaiKoHoff commented 6 years ago

@muhanov-apps :sweat_smile: puhh...

RaiKoHoff commented 6 years ago

@muhanov-apps : please check this issue against latest release (v3.18.311.928) and close this issue, if the problem has been solved.

muhanov-apps commented 6 years ago

@RaiKoHoff after several days... and in this release I've encountered multiple times when enc detection was wrong (1252 instead of 1251) especially in files with mixed content of English and Cyrillic words.

If I correctly understand, detection based on symbol sequences has it's limitations )) It works only when there are many Cyrillic words. And for small files and files with mostly english content and a few Cyrillic words it fails. For example "граната" -> 1251 "граната " -> 1252 (just "граната" + one space)

RaiKoHoff commented 6 years ago

Right, the encoding detection is limited to a heuristic (which is not advanced, yet), that will fail on these special cases. The solution will be a new option, that will skip the heuristic based detection using windows (locale) code page as default.

muhanov-apps commented 6 years ago

@RaiKoHoff as I wrote before... my personal vote for existence of option to skip heuristic and use windows locale )

RaiKoHoff commented 6 years ago

@muhanov-apps : Please try development version (_develop_3.18.312.930) [beta].

Notepad3Portable_develop_3.18.312.930.zip

muhanov-apps commented 6 years ago

what's the purpose of Skip Unicode Detection? with or without it 1251 ansi files are detected as UTF and that's another kind of problem )

RaiKoHoff commented 6 years ago

"Skip UNICODE detection" should forced skip the UNICODE detection, regardless of Byte Order Marks (BOM) or Signature (UTF-8 Sig). If it does not, it is a bug. Please provide an example text file (anonymized/synthetic) for debug purposes. Ensure, that the Unicode Encoding has not been written to File History (Alt-H), that forces NP3 to use this persisted encoding (Notepad3.ini: section [Recent Files]).

muhanov-apps commented 6 years ago

https://blog.muhanov.net/files/ansi1251test.txt

to be clear, we are talking about ansi<->UTF detection problem, not choosing between different types of UTF or different ansi codepages

muhanov-apps commented 6 years ago

@RaiKoHoff sorry, my bad! I forgot to unckeck upper checkbox "Use as fallback on detection failure" when UTF-8 selected as default. Without it 1251 ansi files are detected fine to windows locale

but you may use that file for testing "tellenc" code ))

RaiKoHoff commented 6 years ago

So there is no "forced skip of UNICODE detection (except fallback)" not working anymore? Ed.: (The given file is working fine for me, except is the wrong CP - my locale :wink:)

muhanov-apps commented 6 years ago

all seems fine for now... I will test beta today a bit more... then close issue

RaiKoHoff commented 6 years ago

@muhanov-apps : Thanks for spending effort on testing to enhance Notepad3 😄

RaiKoHoff commented 6 years ago

@data-man : your effort spend in enhancing tellenc's detection quality is welcome for future releases.

data-man commented 6 years ago

@RaiKoHoff Ok. I have destroyed my OS and will migrate to another (NixOS). I guess it will take a long time. :(

RaiKoHoff commented 6 years ago

@data-man : 😲 , I wish you great success 👍

muhanov-apps commented 6 years ago

@RaiKoHoff it seems to me @data-man decided to migrate to gentoo from scratch and practice in solving encoding hell there )))

data-man commented 6 years ago

@RaiKoHoff I have an idea: extract the encoding detect code into a separate repository. May be tellenc.h? (I like header-only libs). Then it will be easier for me to do a PR.

@muhanov-apps Not Gentoo. NixOS is the best! :) And I will install Windows as a second OS. (I used it only in VirtualBox).

RaiKoHoff commented 6 years ago

@ data-man : Separating encoding detection tables (the heart of tellenc) from code base (next PR will reflect this) Tables: EncTables.zip

RaiKoHoff commented 6 years ago

@data-man : I implement a trial, using Google's "Compact Encoding Detection". It increases the size of the binary by ~10%, which is acceptable I think, so I will give it a try. Just an Info to stop investigation for "tellenc" enhancement.

data-man commented 6 years ago

@RaiKoHoff Oh! Great! But I will continue anyway for personal needs. :)

See also uchardet. E.g., Far Manager use this library (very old version).

RaiKoHoff commented 6 years ago

@data-man : development version of Compact Encoding Detection NP3 available (_develop_3.18.313.932) Notepad3Portable_develop_3.18.313.932.zip

Switch OFF Detector Skip to activate detector: (NP3 does not support all detectable encodings (yet))

rizonesoft / Notepad3

UTF Default encoding problem with cyrillic ansi files #387