jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
562 stars 51 forks source link

[DETECTION] latin1 text misdetected as mac_latin2 #477

Closed nijel closed 3 months ago

nijel commented 3 months ago

Notice I hereby announce that my raw input is not :

Provide the file test.zip

Verbose output Using the CLI, run normalizer -v ./my-file.txt and past the result in here.

$ normalizer -v test.txt 
2024-05-22 10:59:29,611 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (42 byte(s) given) parameters.
2024-05-22 10:59:29,611 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xfc in position 18: ordinal not in range(128)
2024-05-22 10:59:29,611 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xfc in position 18: invalid start byte
2024-05-22 10:59:29,612 | Level 5 | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,612 | Level 5 | Code page big5hkscs is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2024-05-22 10:59:29,613 | Level 5 | big5hkscs passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,613 | Level 5 | big5hkscs should target any language(s) of ['Chinese']
2024-05-22 10:59:29,613 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 490.200000 %.
2024-05-22 10:59:29,614 | Level 5 | cp1006 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,617 | Level 5 | cp1006 should target any language(s) of ['Farsi', 'Arabic']
2024-05-22 10:59:29,617 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-05-22 10:59:29,618 | Level 5 | cp1125 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,618 | Level 5 | cp1125 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,619 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-05-22 10:59:29,619 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,619 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,622 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using cp1250
2024-05-22 10:59:29,622 | Level 5 | cp1251 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,623 | Level 5 | cp1251 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,623 | Level 5 | cp1252 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,623 | Level 5 | cp1252 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,624 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using cp1252
2024-05-22 10:59:29,624 | Level 5 | cp1253 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,624 | Level 5 | cp1253 should target any language(s) of ['Greek']
2024-05-22 10:59:29,624 | Level 5 | cp1254 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,625 | Level 5 | cp1254 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,625 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using cp1254
2024-05-22 10:59:29,625 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 18: character maps to <undefined>
2024-05-22 10:59:29,625 | Level 5 | cp1256 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,625 | Level 5 | cp1256 should target any language(s) of ['Farsi', 'Arabic']
2024-05-22 10:59:29,626 | Level 5 | cp1257 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,626 | Level 5 | cp1257 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,626 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using cp1257
2024-05-22 10:59:29,626 | Level 5 | cp1258 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,626 | Level 5 | cp1258 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,626 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using cp1258
2024-05-22 10:59:29,626 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-05-22 10:59:29,627 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x70 in position 1: character maps to <undefined>
2024-05-22 10:59:29,627 | Level 5 | cp437 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,628 | Level 5 | cp437 should target any language(s) of ['Greek']
2024-05-22 10:59:29,628 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-05-22 10:59:29,628 | Level 5 | cp720 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,629 | Level 5 | cp720 should target any language(s) of ['Farsi', 'Arabic']
2024-05-22 10:59:29,629 | Level 5 | cp737 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,630 | Level 5 | cp737 should target any language(s) of ['Greek']
2024-05-22 10:59:29,631 | Level 5 | cp775 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,631 | Level 5 | cp775 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,632 | Level 5 | We detected language [('Dutch', 0.625), ('Swedish', 0.625), ('Norwegian', 0.625), ('Polish', 0.625), ('Romanian', 0.625), ('English', 0.5625), ('Italian', 0.5625), ('Danish', 0.5625), ('Slovak', 0.5625), ('Indonesian', 0.5), ('Croatian', 0.5), ('French', 0.5), ('Turkish', 0.5), ('Czech', 0.5), ('Finnish', 0.4375), ('Slovene', 0.4375), ('Spanish', 0.4375), ('Portuguese', 0.4375), ('Lithuanian', 0.4375), ('German', 0.375), ('Estonian', 0.375), ('Hungarian', 0.3125), ('Vietnamese', 0.1875)] using cp775
2024-05-22 10:59:29,633 | Level 5 | cp850 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,633 | Level 5 | cp850 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,633 | Level 5 | We detected language [('Dutch', 0.625), ('Swedish', 0.625), ('Norwegian', 0.625), ('Polish', 0.625), ('Romanian', 0.625), ('English', 0.5625), ('Italian', 0.5625), ('Danish', 0.5625), ('Slovak', 0.5625), ('Indonesian', 0.5), ('Croatian', 0.5), ('French', 0.5), ('Turkish', 0.5), ('Czech', 0.5), ('Finnish', 0.4375), ('Slovene', 0.4375), ('Spanish', 0.4375), ('Portuguese', 0.4375), ('Lithuanian', 0.4375), ('German', 0.375), ('Estonian', 0.375), ('Hungarian', 0.3125), ('Vietnamese', 0.1875)] using cp850
2024-05-22 10:59:29,634 | Level 5 | cp852 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,634 | Level 5 | cp852 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,635 | Level 5 | We detected language [('English', 0.5882), ('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Slovene', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Indonesian', 0.4706), ('Czech', 0.4706), ('Turkish', 0.4706), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Lithuanian', 0.4118), ('German', 0.3529), ('Estonian', 0.3529), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using cp852
2024-05-22 10:59:29,636 | Level 5 | cp855 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,636 | Level 5 | cp855 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,636 | Level 5 | cp856 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,637 | Level 5 | cp856 should target any language(s) of ['Hebrew']
2024-05-22 10:59:29,637 | Level 5 | cp857 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,637 | Level 5 | cp857 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,637 | Level 5 | We detected language [('Dutch', 0.625), ('Swedish', 0.625), ('Norwegian', 0.625), ('Polish', 0.625), ('Romanian', 0.625), ('English', 0.5625), ('Italian', 0.5625), ('Danish', 0.5625), ('Slovak', 0.5625), ('Indonesian', 0.5), ('Croatian', 0.5), ('French', 0.5), ('Turkish', 0.5), ('Czech', 0.5), ('Finnish', 0.4375), ('Slovene', 0.4375), ('Spanish', 0.4375), ('Portuguese', 0.4375), ('Lithuanian', 0.4375), ('German', 0.375), ('Estonian', 0.375), ('Hungarian', 0.3125), ('Vietnamese', 0.1875)] using cp857
2024-05-22 10:59:29,638 | Level 5 | cp858 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,638 | Level 5 | cp858 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,638 | Level 5 | We detected language [('Dutch', 0.625), ('Swedish', 0.625), ('Norwegian', 0.625), ('Polish', 0.625), ('Romanian', 0.625), ('English', 0.5625), ('Italian', 0.5625), ('Danish', 0.5625), ('Slovak', 0.5625), ('Indonesian', 0.5), ('Croatian', 0.5), ('French', 0.5), ('Turkish', 0.5), ('Czech', 0.5), ('Finnish', 0.4375), ('Slovene', 0.4375), ('Spanish', 0.4375), ('Portuguese', 0.4375), ('Lithuanian', 0.4375), ('German', 0.375), ('Estonian', 0.375), ('Hungarian', 0.3125), ('Vietnamese', 0.1875)] using cp858
2024-05-22 10:59:29,638 | Level 5 | cp860 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,639 | Level 5 | cp860 should target any language(s) of ['Greek']
2024-05-22 10:59:29,639 | Level 5 | cp861 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,639 | Level 5 | cp861 should target any language(s) of ['Greek']
2024-05-22 10:59:29,640 | Level 5 | cp862 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,640 | Level 5 | cp862 should target any language(s) of ['Hebrew']
2024-05-22 10:59:29,640 | Level 5 | cp863 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,641 | Level 5 | cp863 should target any language(s) of ['Greek']
2024-05-22 10:59:29,641 | Level 5 | cp864 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,642 | Level 5 | cp864 should target any language(s) of ['Farsi', 'Arabic']
2024-05-22 10:59:29,642 | Level 5 | cp865 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,642 | Level 5 | cp865 should target any language(s) of ['Greek']
2024-05-22 10:59:29,643 | Level 5 | cp866 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,643 | Level 5 | cp866 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,643 | Level 5 | cp869 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,644 | Level 5 | cp869 should target any language(s) of ['Greek']
2024-05-22 10:59:29,644 | Level 5 | Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 18: character maps to <undefined>
2024-05-22 10:59:29,645 | Level 5 | cp875 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 128.900000 %.
2024-05-22 10:59:29,645 | Level 5 | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,646 | Level 5 | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,646 | Level 5 | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,646 | Level 5 | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,647 | Level 5 | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,647 | Level 5 | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,647 | Level 5 | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,648 | Level 5 | Code page gb18030 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2024-05-22 10:59:29,648 | Level 5 | gb18030 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,648 | Level 5 | gb18030 should target any language(s) of ['Chinese']
2024-05-22 10:59:29,648 | Level 5 | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,649 | Level 5 | Code page gbk is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2024-05-22 10:59:29,649 | Level 5 | gbk passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,649 | Level 5 | gbk should target any language(s) of ['Chinese']
2024-05-22 10:59:29,649 | Level 5 | hp_roman8 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,649 | Level 5 | hp_roman8 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,650 | Level 5 | We detected language [('Dutch', 0.625), ('Swedish', 0.625), ('Norwegian', 0.625), ('Polish', 0.625), ('Romanian', 0.625), ('English', 0.5625), ('Italian', 0.5625), ('Danish', 0.5625), ('Slovak', 0.5625), ('Indonesian', 0.5), ('Croatian', 0.5), ('French', 0.5), ('Turkish', 0.5), ('Czech', 0.5), ('Finnish', 0.4375), ('Slovene', 0.4375), ('Spanish', 0.4375), ('Portuguese', 0.4375), ('Lithuanian', 0.4375), ('German', 0.375), ('Estonian', 0.375), ('Hungarian', 0.3125), ('Vietnamese', 0.1875)] using hp_roman8
2024-05-22 10:59:29,651 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,651 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,652 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,652 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,652 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,653 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,653 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,653 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,654 | Level 5 | iso8859_10 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,654 | Level 5 | iso8859_10 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,654 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_10
2024-05-22 10:59:29,654 | Level 5 | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 18: character maps to <undefined>
2024-05-22 10:59:29,655 | Level 5 | iso8859_13 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,655 | Level 5 | iso8859_13 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,655 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_13
2024-05-22 10:59:29,655 | Level 5 | iso8859_14 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,656 | Level 5 | iso8859_14 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,656 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_14
2024-05-22 10:59:29,656 | Level 5 | iso8859_15 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,657 | Level 5 | iso8859_15 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,657 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_15
2024-05-22 10:59:29,657 | Level 5 | iso8859_16 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,658 | Level 5 | iso8859_16 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,658 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_16
2024-05-22 10:59:29,658 | Level 5 | iso8859_2 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,659 | Level 5 | iso8859_2 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,659 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_2
2024-05-22 10:59:29,659 | Level 5 | iso8859_3 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,659 | Level 5 | iso8859_3 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,659 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_3
2024-05-22 10:59:29,660 | Level 5 | iso8859_4 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,660 | Level 5 | iso8859_4 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,660 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_4
2024-05-22 10:59:29,661 | Level 5 | iso8859_5 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,661 | Level 5 | iso8859_5 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,662 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 18: character maps to <undefined>
2024-05-22 10:59:29,662 | Level 5 | iso8859_7 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,662 | Level 5 | iso8859_7 should target any language(s) of ['Greek']
2024-05-22 10:59:29,663 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 18: character maps to <undefined>
2024-05-22 10:59:29,663 | Level 5 | iso8859_9 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,663 | Level 5 | iso8859_9 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,663 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using iso8859_9
2024-05-22 10:59:29,664 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,664 | Level 5 | koi8_r passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,665 | Level 5 | koi8_r should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,666 | Level 5 | koi8_t passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,666 | Level 5 | koi8_t should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,666 | Level 5 | koi8_u passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,667 | Level 5 | koi8_u should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,667 | Level 5 | kz1048 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,668 | Level 5 | kz1048 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,669 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,669 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,669 | Level 5 | We detected language [('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Polish', 0.5882), ('Romanian', 0.5882), ('Turkish', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Estonian', 0.4706), ('Czech', 0.4706), ('German', 0.4118), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using latin_1
2024-05-22 10:59:29,669 | Level 5 | mac_cyrillic passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,670 | Level 5 | mac_cyrillic should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,671 | Level 5 | mac_greek passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,671 | Level 5 | mac_greek should target any language(s) of ['Greek']
2024-05-22 10:59:29,672 | Level 5 | mac_iceland passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,672 | Level 5 | mac_iceland should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,674 | Level 5 | We detected language [('Dutch', 0.625), ('Swedish', 0.625), ('Norwegian', 0.625), ('Polish', 0.625), ('Romanian', 0.625), ('English', 0.5625), ('Italian', 0.5625), ('Danish', 0.5625), ('Slovak', 0.5625), ('Indonesian', 0.5), ('Croatian', 0.5), ('French', 0.5), ('Turkish', 0.5), ('Czech', 0.5), ('Finnish', 0.4375), ('Slovene', 0.4375), ('Spanish', 0.4375), ('Portuguese', 0.4375), ('Lithuanian', 0.4375), ('German', 0.375), ('Estonian', 0.375), ('Hungarian', 0.3125), ('Vietnamese', 0.1875)] using mac_iceland
2024-05-22 10:59:29,674 | Level 5 | mac_latin2 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,675 | Level 5 | mac_latin2 should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,676 | Level 5 | We detected language [('Polish', 0.6471), ('English', 0.5882), ('Dutch', 0.5882), ('Italian', 0.5882), ('Swedish', 0.5882), ('Norwegian', 0.5882), ('Danish', 0.5882), ('Romanian', 0.5882), ('Slovene', 0.5294), ('Croatian', 0.5294), ('Slovak', 0.5294), ('Indonesian', 0.4706), ('Turkish', 0.4706), ('Czech', 0.4706), ('Finnish', 0.4118), ('French', 0.4118), ('Portuguese', 0.4118), ('Lithuanian', 0.4118), ('German', 0.3529), ('Estonian', 0.3529), ('Spanish', 0.3529), ('Hungarian', 0.2941), ('Vietnamese', 0.1765)] using mac_latin2
2024-05-22 10:59:29,677 | Level 5 | mac_roman passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,677 | Level 5 | mac_roman should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,677 | Level 5 | We detected language [('Dutch', 0.625), ('Swedish', 0.625), ('Norwegian', 0.625), ('Polish', 0.625), ('Romanian', 0.625), ('English', 0.5625), ('Italian', 0.5625), ('Danish', 0.5625), ('Slovak', 0.5625), ('Indonesian', 0.5), ('Croatian', 0.5), ('French', 0.5), ('Turkish', 0.5), ('Czech', 0.5), ('Finnish', 0.4375), ('Slovene', 0.4375), ('Spanish', 0.4375), ('Portuguese', 0.4375), ('Lithuanian', 0.4375), ('German', 0.375), ('Estonian', 0.375), ('Hungarian', 0.3125), ('Vietnamese', 0.1875)] using mac_roman
2024-05-22 10:59:29,677 | Level 5 | mac_turkish passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,678 | Level 5 | mac_turkish should target any language(s) of ['Latin Based']
2024-05-22 10:59:29,678 | Level 5 | We detected language [('Dutch', 0.625), ('Swedish', 0.625), ('Norwegian', 0.625), ('Polish', 0.625), ('Romanian', 0.625), ('English', 0.5625), ('Italian', 0.5625), ('Danish', 0.5625), ('Slovak', 0.5625), ('Indonesian', 0.5), ('Croatian', 0.5), ('French', 0.5), ('Turkish', 0.5), ('Czech', 0.5), ('Finnish', 0.4375), ('Slovene', 0.4375), ('Spanish', 0.4375), ('Portuguese', 0.4375), ('Lithuanian', 0.4375), ('German', 0.375), ('Estonian', 0.375), ('Hungarian', 0.3125), ('Vietnamese', 0.1875)] using mac_turkish
2024-05-22 10:59:29,678 | Level 5 | ptcp154 passed initial chaos probing. Mean measured chaos is 9.500000 %
2024-05-22 10:59:29,678 | Level 5 | ptcp154 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-05-22 10:59:29,679 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xfc in position 18: illegal multibyte sequence
2024-05-22 10:59:29,679 | Level 5 | Code page shift_jis_2004 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2024-05-22 10:59:29,679 | Level 5 | shift_jis_2004 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,679 | Level 5 | shift_jis_2004 should target any language(s) of ['Japanese']
2024-05-22 10:59:29,679 | Level 5 | Code page shift_jisx0213 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2024-05-22 10:59:29,679 | Level 5 | shift_jisx0213 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,679 | Level 5 | shift_jisx0213 should target any language(s) of ['Japanese']
2024-05-22 10:59:29,680 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfc in position 18: character maps to <undefined>
2024-05-22 10:59:29,680 | Level 5 | Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2024-05-22 10:59:29,680 | Level 5 | Code page utf_16_be is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2024-05-22 10:59:29,680 | Level 5 | utf_16_be passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-05-22 10:59:29,680 | Level 5 | Code page utf_16_le is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2024-05-22 10:59:29,681 | Level 5 | utf_16_le was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 72.700000 %.
2024-05-22 10:59:29,681 | Level 5 | Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2024-05-22 10:59:29,681 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2024-05-22 10:59:29,681 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2024-05-22 10:59:29,681 | Level 5 | Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
2024-05-22 10:59:29,681 | DEBUG | Encoding detection: Found mac_latin2 as plausible (best-candidate) for content. With 20 alternatives.
{
    "path": "/home/nijel/work/python-debian/test.txt",
    "encoding": "mac_latin2",
    "encoding_aliases": [
        "maccentraleurope",
        "mac_centeuro",
        "maclatin2"
    ],
    "alternative_encodings": [],
    "language": "Polish",
    "alphabets": [
        "Basic Latin",
        "Latin Extended-A"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 64.71,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding latin1 would be probably the best fit, chardet reports this as ISO-8859-9 which works as well.

Desktop (please complete the following information):

Additional context

Discovered when trying to port python-debian to charset_normalizer instead of chardet, this fails the testsuite: https://salsa.debian.org/nijel/python-debian/-/jobs/5758249

I know the text is short, but it's challenging to sell migration to a different library when it breaks existing tests.

Ousret commented 3 months ago

I'll see what I can do. Can't promise anything.

Ousret commented 3 months ago

Unfortunately, with the given content, I am unable to determine anything that can help weight-in for the right direction. If you happen to know any "language"-theory that would help in that case, I'll look into it.

As you are already aware of, charset-normalizer handle 90+ encodings, and we absolutely want to avoid adding some hardcoded logic like (latin1 > mac_latin).

but maybe, we could set this logic for tiny sequences only (e.g. order of presented results).

By looking at the CI, both decoded string (mac_latin & latin) are perfectly valid, and actually exist if you search for these terms.

regards,

nijel commented 3 months ago

You wouldn't most likely write KŁster, but Kłster. I have no clue if you can somehow weight such things as expected upper-casing.

Ousret commented 3 months ago

OK, so we're most likely trapped on this one. The fastest way to convince your upstream project is to use https://charset-normalizer.readthedocs.io/en/latest/user/advanced_search.html the main API and restrict the supported encoding to those of chardet. cp_isolation=None, # Finite list of encoding to use when searching for a match This should fix the issue, if so, feel free to close it.

regards,

nijel commented 3 months ago

Thanks for suggestion, I've given it a try at https://salsa.debian.org/python-debian-team/python-debian/-/merge_requests/135

Ousret commented 3 months ago

Great.

            cp_isolation=[f"iso-8859-{n}" for n in (1, 2, 7, 8, 9)]

Don't forget, to add "ascii", "utf_8", "utf_16", ... the basic ones.

nijel commented 3 months ago

The encoding is first tried as utf-8, so that covers both ascii and utf-8 before passing to charset_normalizer. Mixing utf-16 into ascii like file seems unlikely to me.