notepad-plus-plus / notepad-plus-plus

Notepad++ official repository
https://notepad-plus-plus.org/
Other
23.2k stars 4.63k forks source link

[Feature request] Autodetect an encoding when saving untitled .html files #15899

Open pawelzwronek opened 1 day ago

pawelzwronek commented 1 day ago

Is there an existing issue for this?

Description of the Issue

  1. Paste the code below into untitled tab and save as .html. Notice charset=ISO-8859-1 encoding.
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
</head>
<body>
Ax  NBSP    ¡   ¢   £   ¤   ¥   ¦   §   ¨   ©   ª   «   ¬   SHY     ®   ¯</br>
Bx  °   ±   ²   ³   ´   µ   ¶   ·   ¸   ¹   º   »   ¼   ½   ¾   ¿</br>
Cx  À   Á   Â   Ã   Ä   Å   Æ   Ç   È   É   Ê   Ë   Ì   Í   Î   Ï</br>
Dx  Ð   Ñ   Ò   Ó   Ô   Õ   Ö   ×   Ø   Ù   Ú   Û   Ü   Ý   Þ   ß</br>
Ex  à   á   â   ã   ä   å   æ   ç   è   é   ê   ë   ì   í   î   ï</br>
Fx  ð   ñ   ò   ó   ô   õ   ö   ÷   ø   ù   ú   û   ü   ý   þ   ÿ</br>
</body>
  1. Open in web browser. You'll see:

obraz

Describe the solution you'd like.

I propose to autodetect an encoding from the file content when saving untitled buffer as .html, and switch from default UTF8 to detected encoding. Such autodetection is taking place in current version of N++ when opening .html file.

When proposed autodetection would work, encoding of saved file would switch to ISO-8859-1 and you will see correct encoded characters in the browser.

Debug Information

Notepad++ v8.7.1   (32-bit)
Build time : Oct 31 2024 - 00:41:42
Path : C:\Program Files (x86)\Notepad++\notepad++.exe
Command Line : 
Admin mode : OFF
Local Conf mode : OFF
Cloud Config : OFF
Periodic Backup : ON
OS Name : Windows 10 Enterprise (64-bit)
OS Version : 22H2
OS Build : 19045.4780
Current ANSI codepage : 1250
Plugins : 
    combine (1)
    ComparePlugin (2.0.2)
    JSMinNPP (1.2205)
    LanguageHelp (1.7.5)
    mimeTools (3.1)
    NppCCompletionPlugin (1.19)
    NppConverter (4.6)
    NppExport (0.4)
    NppTextFX (0.2.6)
    RunMe (1.4.1)
    XMLTools (3.1.1.13)

Anything else?

No response

softmgr commented 11 hours ago

This type of HTML file can only detect the keyword charset=ISO-8859-1 in its content to obtain the string ISO-8859-1, which is then converted to a codepage. This is because the encoding obtained through the uchardet component has a confidence level of confidence = 0.5f;, making it uncertain what the encoding actually is.

pawelzwronek commented 8 hours ago

This type of HTML file can only detect the keyword charset=ISO-8859-1 in its content to obtain the string ISO-8859-1, which is then converted to a codepage.

There is Notepad_plus::getHtmlXmlEncoding already that do that. The question is if it's a expected and desired behaviour to implement.

This is because the encoding obtained through the uchardet component has a confidence level of confidence = 0.5f;, making it uncertain what the encoding actually is.

Actually uchardet detection is only executed when opening/reloading a file, when encoding is set to UTF8 or is undefined. Moreover it's not executed when .html or .xml is opened and an encoding is detected with getHtmlXmlEncoding.