This should be UTF-8. One clue is that the output includes the word Päätösehdotus which is a mangled version of Päätösehdotus.
Most nontrivial Finnish text will include several instances of the character ä and possibly ö. Upper-case versions Ä and Ö are possible but less common. When UTF-8 is interpreted as Latin-1 or Windows-1252, these become
ä → \xc3\xa4 → ä
ö → \xc3\xb6 → ö
Ä → \xc3\x84 → à and a control character, or Ä
Ä → \xc3\x96 → à and a control character, or Ö
The characters 䶄 do not appear in normal Finnish text. à could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination À. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)
Desktop (please complete the following information):
OS: MacOS 14.7
Python version 3.12.6
Package version 3.3.2
Additional context
My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.
Notice I hereby announce that my raw input is not :
Provide the file A accessible way of retrieving the file concerned. Host it somewhere with untouched encoding.
https://jouniseppanen.fi/tmp/finnish-utf-8-latin-1-confusion.html
(Note that the web server adds a content type of text/html; charset=utf-8 which is correct, so your browser will likely show the text correctly.)
Verbose output
Expected encoding
This should be UTF-8. One clue is that the output includes the word
Päätösehdotus
which is a mangled version ofPäätösehdotus
.Most nontrivial Finnish text will include several instances of the character
ä
and possiblyö
. Upper-case versionsÄ
andÖ
are possible but less common. When UTF-8 is interpreted as Latin-1 or Windows-1252, these becomeThe characters 䶄 do not appear in normal Finnish text. à could possibly appear in foreign names, but would even then seem to be very unlikely in the middle of a word. ¤ is an obscure "currency sign" character, whose codepoint Latin-9 aka ISO-8859-15 reassigned to the euro sign, which does occur in Finnish text but would still be very unlikely in the combination À. (The pilcrow might appear in some typography text and the lowered quote might appear in old-fashioned literature. The en dash is normal.)
Desktop (please complete the following information):
Additional context
My guess is that this kind of thing happens when someone set up a CMS in the 1990s when Finnish text was commonly encoded in Latin-1 or Windows-1252, and later the data store was changed to use UTF-8 but the meta tags were neglected.