jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
583 stars 51 forks source link

[DETECTION] Finnish in UTF-8 detected as Latin-1 when mistaken html meta element present #537

Closed jkseppan closed 1 month ago

jkseppan commented 1 month ago

Notice I hereby announce that my raw input is not :

Provide the file A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.

https://jouniseppanen.fi/tmp/finnish-utf-8-latin-1-confusion.html

(Note that the web server adds a content type of text/html; charset=utf-8 which is correct, so your browser will likely show the text correctly.)

Verbose output

2024-10-02 08:40:59,849 | Level 5 | Detected declarative mark in sequence. Priority +1 given for latin_1.
2024-10-02 08:40:59,852 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 0.533000 %
2024-10-02 08:40:59,852 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-10-02 08:40:59,857 | Level 5 | We detected language [('English', 0.656), ('Hungarian', 0.5849), ('French', 0.578), ('Spanish', 0.5486), ('Norwegian', 0.5294), ('Dutch', 0.5243), ('Finnish', 0.5221), ('Indonesian', 0.5191), ('Italian', 0.5174), ('Estonian', 0.5152), ('Danish', 0.5047), ('Swedish', 0.4706), ('Slovene', 0.4669), ('Croatian', 0.4662), ('Portuguese', 0.4648), ('Czech', 0.4546), ('Romanian', 0.4492), ('German', 0.4409), ('Slovak', 0.4296), ('Turkish', 0.4224), ('Polish', 0.3995), ('Lithuanian', 0.3933), ('Vietnamese', 0.3714)] using latin_1
2024-10-02 08:40:59,857 | DEBUG | Encoding detection: latin_1 is most likely the one.
{
    "path": "/tmp/finnish-utf-8-latin-1-confusion.html",
    "encoding": "latin_1",
    "encoding_aliases": [
        "8859",
        "cp819",
        "csisolatin1",
        "ibm819",
        "iso8859",
        "iso8859_1",
        "iso_8859_1",
        "iso_8859_1_1987",
        "iso_ir_100",
        "l1",
        "latin",
        "latin1"
    ],
    "alternative_encodings": [],
    "language": "English",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.533,
    "coherence": 65.6,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

This should be UTF-8. One clue is that the output includes the word Päätösehdotus which is a mangled version of Päätösehdotus.

Most nontrivial Finnish text will include several instances of the character ä and possibly ö. Upper-case versions Ä and Ö are possible but less common. When UTF-8 is interpreted as Latin-1 or Windows-1252, these become

The characters 䶄 do not appear in normal Finnish text. à could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination À. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)

Desktop (please complete the following information):

Additional context

My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.

Ousret commented 1 month ago

This case has been fixed in #538 Will be available in the next release.