jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
544 stars 49 forks source link

html file is not reported as UTF8 after conversion #381

Open hrvoj3e opened 8 months ago

hrvoj3e commented 8 months ago

Provide the file 110-original.zip

Verbose output Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

❯ # rm+unzip

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:42:49,817 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:42:49,821 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:42:49,821 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:42:49,830 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:42:49,830 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250

❯ normalizer -rfnvvv 110-original.htm
2023-11-08 16:39:42,180 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:39:42,183 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.783000 %
2023-11-08 16:39:42,184 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:39:42,192 | Level 5 | We detected language [('English', 0.6666), ('Indonesian', 0.5465), ('Dutch', 0.5131), ('Czech', 0.5052), ('Croatian', 0.4924), ('Slovak', 0.4878), ('Spanish', 0.4826), ('Italian', 0.4811), ('Slovene', 0.4773), ('Norwegian', 0.4647), ('Lithuanian', 0.458), ('Finnish', 0.458), ('Swedish', 0.4576), ('Romanian', 0.4563), ('Hungarian', 0.456), ('French', 0.4541), ('Danish', 0.4393), ('German', 0.4236), ('Polish', 0.4056), ('Portuguese', 0.4047), ('Vietnamese', 0.3819), ('Estonian', 0.3776), ('Turkish', 0.3677)] using cp1250
2023-11-08 16:39:42,192 | DEBUG | Encoding detection: cp1250 is most likely the one.
{
    "path": "/home/adax/code/other/encoding/110-original.htm",
    "encoding": "cp1250",
    "encoding_aliases": [
        "1250",
        "windows_1250"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "General Punctuation",
        "Latin Extended-A",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.783,
    "coherence": 66.66,
    "unicode_path": "/home/adax/code/other/encoding/110-original.htm",
    "is_preferred": true
}

❯ normalizer -mvvv 110-original.htm
2023-11-08 16:41:07,958 | Level 5 | Detected declarative mark in sequence. Priority +1 given for cp1250.
2023-11-08 16:41:07,961 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 1.267000 %
2023-11-08 16:41:07,962 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2023-11-08 16:41:07,970 | Level 5 | We detected language [('English', 0.7029), ('Indonesian', 0.572), ('Dutch', 0.51), ('Italian', 0.4949), ('Czech', 0.4862), ('Spanish', 0.4806), ('Croatian', 0.4724), ('Norwegian', 0.4692), ('Slovene', 0.4669), ('Romanian', 0.4632), ('Hungarian', 0.4624), ('Slovak', 0.4605), ('Finnish', 0.4565), ('German', 0.4533), ('Swedish', 0.4453), ('French', 0.443), ('Danish', 0.4366), ('Portuguese', 0.4116), ('Polish', 0.4113), ('Lithuanian', 0.3931), ('Estonian', 0.3828), ('Turkish', 0.3828), ('Vietnamese', 0.3795)] using cp1250
2023-11-08 16:41:07,970 | DEBUG | Encoding detection: cp1250 is most likely the one.
cp1250

enca will however detect UTF-8 as it should

❯ # rm+unzip

❯ enca -L hr 110-original.htm
Unrecognized encoding

❯ normalizer -rfnvvv 110-original.htm

❯ enca -L hr 110-original.htm
Universal transformation format 8 bits; UTF-8
  CRLF line terminators

Expected encoding Expected normalizer to show UTF-8 encoding after conversion to UTF-8. Am I wrong here?

Desktop (please complete the following information):

Additional context I know. Html is not the same as text. But I will document this here.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....

Ousret commented 8 months ago

Yes, you are correct. What you have is somewhat edge, but problematic nonetheless.

I think that "declarative mark" should not take over like that. But I am new to this encoding world....

Not entirely true, it's more complicated than that.

Fortunately, I know how to fix this. I don't know exactly when, but soon. The idea is to do a preg replace within the normalizer CLI if there is a declarative mark.