jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
580 stars 51 forks source link

[DETECTION] Mostly UTF-8 html page detects as gb18030 #516

Closed KaraKaraWitch closed 1 month ago

KaraKaraWitch commented 2 months ago

Notice I hereby announce that my raw input is not :

File

https://files.catbox.moe/h3bf02.html

Alternatively, it may be downloaded from archive.org since it's from CommonCrawl: https://web.archive.org/web/20240302062735im_/https://gaming.lenovo.com/emea/members/120723-Deminy?s=0e175b5c146036655dd127866a5a7999

Verbose output

<REDACTED>@<REDACTED>$ normalizer -v View\ Profile_\ -\ Legion\ Gaming\ Community.html 
2024-08-29 16:07:57,102 | Level 5 | Detected declarative mark in sequence. Priority +1 given for utf_8.
2024-08-29 16:07:57,103 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode bytes in position 107206-107207: invalid continuation byte
2024-08-29 16:07:57,103 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xc3 in position 25290: ordinal not in range(128)
2024-08-29 16:07:57,104 | Level 5 | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xc5 in position 25567: illegal multibyte sequence
2024-08-29 16:07:57,105 | Level 5 | Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0xc5 in position 25567: illegal multibyte sequence
2024-08-29 16:07:57,106 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 693.650000 %.
2024-08-29 16:07:57,109 | Level 5 | cp1006 passed initial chaos probing. Mean measured chaos is 2.333000 %
2024-08-29 16:07:57,113 | Level 5 | cp1006 should target any language(s) of ['Farsi', 'Arabic']
2024-08-29 16:07:57,115 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-08-29 16:07:57,117 | Level 5 | cp1125 passed initial chaos probing. Mean measured chaos is 1.667000 %
2024-08-29 16:07:57,117 | Level 5 | cp1125 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-08-29 16:07:57,120 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-08-29 16:07:57,120 | Level 5 | Code page cp1250 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x83 in position 25669: character maps to <undefined>
2024-08-29 16:07:57,121 | Level 5 | Code page cp1251 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 46325: character maps to <undefined>
2024-08-29 16:07:57,121 | Level 5 | Code page cp1252 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 25671: character maps to <undefined>
2024-08-29 16:07:57,122 | Level 5 | Code page cp1253 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 25671: character maps to <undefined>
2024-08-29 16:07:57,122 | Level 5 | Code page cp1254 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 25671: character maps to <undefined>
2024-08-29 16:07:57,122 | Level 5 | Code page cp1255 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 25671: character maps to <undefined>
2024-08-29 16:07:57,124 | Level 5 | cp1256 passed initial chaos probing. Mean measured chaos is 0.483000 %
2024-08-29 16:07:57,124 | Level 5 | cp1256 should target any language(s) of ['Farsi', 'Arabic']
2024-08-29 16:07:57,126 | Level 5 | Code page cp1257 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x83 in position 25669: character maps to <undefined>
2024-08-29 16:07:57,126 | Level 5 | Code page cp1258 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x81 in position 25671: character maps to <undefined>
2024-08-29 16:07:57,127 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-08-29 16:07:57,127 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x72 in position 25: character maps to <undefined>
2024-08-29 16:07:57,129 | Level 5 | cp437 passed initial chaos probing. Mean measured chaos is 1.683000 %
2024-08-29 16:07:57,129 | Level 5 | cp437 should target any language(s) of ['Greek']
2024-08-29 16:07:57,131 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-08-29 16:07:57,133 | Level 5 | cp720 passed initial chaos probing. Mean measured chaos is 2.217000 %
2024-08-29 16:07:57,134 | Level 5 | cp720 should target any language(s) of ['Farsi', 'Arabic']
2024-08-29 16:07:57,138 | Level 5 | cp737 passed initial chaos probing. Mean measured chaos is 1.600000 %
2024-08-29 16:07:57,138 | Level 5 | cp737 should target any language(s) of ['Greek']
2024-08-29 16:07:57,142 | Level 5 | cp775 passed initial chaos probing. Mean measured chaos is 2.083000 %
2024-08-29 16:07:57,142 | Level 5 | cp775 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,149 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5966), ('Danish', 0.5763), ('Swedish', 0.5521), ('Slovene', 0.5455), ('Italian', 0.5433), ('Estonian', 0.54), ('Finnish', 0.5363), ('Czech', 0.5361), ('Dutch', 0.5348), ('Hungarian', 0.5125), ('French', 0.5121), ('Indonesian', 0.5103), ('Spanish', 0.4981), ('German', 0.4702), ('Romanian', 0.4614), ('Portuguese', 0.4556), ('Slovak', 0.4544), ('Croatian', 0.4383), ('Polish', 0.4312), ('Turkish', 0.4249), ('Lithuanian', 0.4077), ('Vietnamese', 0.3714)] using cp775
2024-08-29 16:07:57,153 | Level 5 | cp850 passed initial chaos probing. Mean measured chaos is 2.067000 %
2024-08-29 16:07:57,154 | Level 5 | cp850 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,156 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5708), ('Danish', 0.5504), ('Slovene', 0.5455), ('Italian', 0.5318), ('Swedish', 0.5273), ('Finnish', 0.5262), ('Estonian', 0.5227), ('French', 0.5172), ('Dutch', 0.517), ('Indonesian', 0.5103), ('Hungarian', 0.5032), ('Czech', 0.5028), ('Spanish', 0.4798), ('Portuguese', 0.4592), ('German', 0.4528), ('Romanian', 0.4526), ('Polish', 0.4379), ('Slovak', 0.4287), ('Croatian', 0.4234), ('Turkish', 0.416), ('Lithuanian', 0.4077), ('Vietnamese', 0.3577)] using cp850
2024-08-29 16:07:57,160 | Level 5 | cp852 passed initial chaos probing. Mean measured chaos is 2.067000 %
2024-08-29 16:07:57,160 | Level 5 | cp852 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,163 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5708), ('Danish', 0.5504), ('Slovene', 0.5455), ('Italian', 0.5318), ('Swedish', 0.5273), ('Finnish', 0.5262), ('Estonian', 0.5227), ('French', 0.5172), ('Dutch', 0.517), ('Indonesian', 0.5103), ('Hungarian', 0.5032), ('Czech', 0.5028), ('Spanish', 0.4798), ('Portuguese', 0.4592), ('German', 0.4528), ('Romanian', 0.4526), ('Croatian', 0.4451), ('Slovak', 0.443), ('Polish', 0.4379), ('Turkish', 0.416), ('Lithuanian', 0.4077), ('Vietnamese', 0.3577)] using cp852
2024-08-29 16:07:57,168 | Level 5 | cp855 passed initial chaos probing. Mean measured chaos is 2.200000 %
2024-08-29 16:07:57,168 | Level 5 | cp855 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-08-29 16:07:57,171 | Level 5 | Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa7 in position 25408: character maps to <undefined>
2024-08-29 16:07:57,174 | Level 5 | cp857 passed initial chaos probing. Mean measured chaos is 2.133000 %
2024-08-29 16:07:57,174 | Level 5 | cp857 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,177 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5708), ('Danish', 0.5504), ('Slovene', 0.5455), ('Italian', 0.5318), ('Swedish', 0.5273), ('Finnish', 0.5262), ('Estonian', 0.5227), ('French', 0.5172), ('Dutch', 0.517), ('Indonesian', 0.5103), ('Hungarian', 0.5032), ('Czech', 0.5028), ('Spanish', 0.4798), ('Portuguese', 0.4592), ('German', 0.4528), ('Romanian', 0.4526), ('Polish', 0.4379), ('Slovak', 0.4287), ('Croatian', 0.4234), ('Turkish', 0.4234), ('Lithuanian', 0.4077), ('Vietnamese', 0.3577)] using cp857
2024-08-29 16:07:57,180 | Level 5 | cp858 passed initial chaos probing. Mean measured chaos is 2.067000 %
2024-08-29 16:07:57,181 | Level 5 | cp858 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,181 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5708), ('Danish', 0.5504), ('Slovene', 0.5455), ('Italian', 0.5318), ('Swedish', 0.5273), ('Finnish', 0.5262), ('Estonian', 0.5227), ('French', 0.5172), ('Dutch', 0.517), ('Indonesian', 0.5103), ('Hungarian', 0.5032), ('Czech', 0.5028), ('Spanish', 0.4798), ('Portuguese', 0.4592), ('German', 0.4528), ('Romanian', 0.4526), ('Polish', 0.4379), ('Slovak', 0.4287), ('Croatian', 0.4234), ('Turkish', 0.416), ('Lithuanian', 0.4077), ('Vietnamese', 0.3577)] using cp858
2024-08-29 16:07:57,184 | Level 5 | cp860 passed initial chaos probing. Mean measured chaos is 2.000000 %
2024-08-29 16:07:57,184 | Level 5 | cp860 should target any language(s) of ['Greek']
2024-08-29 16:07:57,189 | Level 5 | cp861 passed initial chaos probing. Mean measured chaos is 1.833000 %
2024-08-29 16:07:57,190 | Level 5 | cp861 should target any language(s) of ['Greek']
2024-08-29 16:07:57,196 | Level 5 | cp862 passed initial chaos probing. Mean measured chaos is 1.867000 %
2024-08-29 16:07:57,196 | Level 5 | cp862 should target any language(s) of ['Hebrew']
2024-08-29 16:07:57,202 | Level 5 | cp863 passed initial chaos probing. Mean measured chaos is 2.033000 %
2024-08-29 16:07:57,203 | Level 5 | cp863 should target any language(s) of ['Greek']
2024-08-29 16:07:57,207 | Level 5 | Code page cp864 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa7 in position 25408: character maps to <undefined>
2024-08-29 16:07:57,208 | Level 5 | cp865 passed initial chaos probing. Mean measured chaos is 1.683000 %
2024-08-29 16:07:57,209 | Level 5 | cp865 should target any language(s) of ['Greek']
2024-08-29 16:07:57,214 | Level 5 | cp866 passed initial chaos probing. Mean measured chaos is 1.667000 %
2024-08-29 16:07:57,214 | Level 5 | cp866 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-08-29 16:07:57,216 | Level 5 | Code page cp869 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x82 in position 25568: character maps to <undefined>
2024-08-29 16:07:57,217 | Level 5 | Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x82 in position 25568: character maps to <undefined>
2024-08-29 16:07:57,217 | Level 5 | cp875 was excluded because of initial chaos probing. Gave up 2 time(s). Computed mean chaos is 400.000000 %.
2024-08-29 16:07:57,218 | Level 5 | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0x87 in position 25577: illegal multibyte sequence
2024-08-29 16:07:57,219 | Level 5 | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xd0 in position 25666: illegal multibyte sequence
2024-08-29 16:07:57,219 | Level 5 | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xc5 in position 25567: illegal multibyte sequence
2024-08-29 16:07:57,220 | Level 5 | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xc5 in position 25567: illegal multibyte sequence
2024-08-29 16:07:57,220 | Level 5 | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xc5 in position 25567: illegal multibyte sequence
2024-08-29 16:07:57,220 | Level 5 | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xc5 in position 25567: illegal multibyte sequence
2024-08-29 16:07:57,221 | Level 5 | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xc5 in position 25567: illegal multibyte sequence
2024-08-29 16:07:57,222 | Level 5 | Code page gb18030 is a multi byte encoding table and it appear that at least one character was encoded using n-bytes.
2024-08-29 16:07:57,224 | Level 5 | gb18030 passed initial chaos probing. Mean measured chaos is 0.320000 %
2024-08-29 16:07:57,224 | Level 5 | gb18030 should target any language(s) of ['Chinese']
2024-08-29 16:07:57,229 | Level 5 | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xc5 in position 25567: illegal multibyte sequence
2024-08-29 16:07:57,229 | Level 5 | Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xa7 in position 40551: illegal multibyte sequence
2024-08-29 16:07:57,231 | Level 5 | hp_roman8 passed initial chaos probing. Mean measured chaos is 2.083000 %
2024-08-29 16:07:57,231 | Level 5 | hp_roman8 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,234 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5997), ('Danish', 0.5646), ('Italian', 0.5615), ('Czech', 0.5532), ('Spanish', 0.5474), ('Slovene', 0.5455), ('Estonian', 0.5423), ('Dutch', 0.5382), ('Finnish', 0.5315), ('French', 0.5302), ('Swedish', 0.5257), ('Indonesian', 0.5103), ('Hungarian', 0.5074), ('Slovak', 0.4872), ('Romanian', 0.4706), ('Portuguese', 0.4693), ('German', 0.4668), ('Croatian', 0.4429), ('Polish', 0.4421), ('Turkish', 0.4273), ('Lithuanian', 0.4077), ('Vietnamese', 0.395)] using hp_roman8
2024-08-29 16:07:57,239 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xc3 in position 25290: illegal multibyte sequence
2024-08-29 16:07:57,239 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xc3 in position 25290: illegal multibyte sequence
2024-08-29 16:07:57,240 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xc3 in position 25290: illegal multibyte sequence
2024-08-29 16:07:57,240 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xc3 in position 25290: illegal multibyte sequence
2024-08-29 16:07:57,241 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xc3 in position 25290: illegal multibyte sequence
2024-08-29 16:07:57,241 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xc3 in position 25290: illegal multibyte sequence
2024-08-29 16:07:57,241 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xc3 in position 25290: illegal multibyte sequence
2024-08-29 16:07:57,242 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xc3 in position 25290: illegal multibyte sequence
2024-08-29 16:07:57,244 | Level 5 | iso8859_10 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,244 | Level 5 | iso8859_10 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,247 | Level 5 | We detected language [('English', 0.6724), ('Norwegian', 0.6114), ('Danish', 0.5763), ('Italian', 0.5658), ('Estonian', 0.5537), ('Swedish', 0.5444), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Slovene', 0.4991), ('Indonesian', 0.4928), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Portuguese', 0.457), ('Croatian', 0.4454), ('Polish', 0.4449), ('Turkish', 0.4304), ('Lithuanian', 0.3957), ('Vietnamese', 0.3901)] using iso8859_10
2024-08-29 16:07:57,254 | Level 5 | iso8859_11 passed initial chaos probing. Mean measured chaos is 2.333000 %
2024-08-29 16:07:57,254 | Level 5 | iso8859_11 should target any language(s) of ['Thai']
2024-08-29 16:07:57,262 | Level 5 | iso8859_13 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,262 | Level 5 | iso8859_13 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,265 | Level 5 | We detected language [('English', 0.6724), ('Norwegian', 0.6114), ('Danish', 0.5763), ('Italian', 0.5658), ('Estonian', 0.5537), ('Swedish', 0.5444), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Slovene', 0.5091), ('Indonesian', 0.4928), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Croatian', 0.4611), ('Portuguese', 0.457), ('Polish', 0.4449), ('Turkish', 0.4304), ('Lithuanian', 0.4057), ('Vietnamese', 0.3901)] using iso8859_13
2024-08-29 16:07:57,272 | Level 5 | iso8859_14 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,272 | Level 5 | iso8859_14 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,275 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.6114), ('Danish', 0.5763), ('Italian', 0.5658), ('Estonian', 0.5537), ('Slovene', 0.5455), ('Swedish', 0.5444), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Indonesian', 0.5103), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Portuguese', 0.457), ('Croatian', 0.4454), ('Polish', 0.4449), ('Turkish', 0.4304), ('Lithuanian', 0.4077), ('Vietnamese', 0.3901)] using iso8859_14
2024-08-29 16:07:57,281 | Level 5 | iso8859_15 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,281 | Level 5 | iso8859_15 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,281 | Level 5 | We detected language [('English', 0.6724), ('Norwegian', 0.6114), ('Danish', 0.5763), ('Italian', 0.5658), ('Estonian', 0.5537), ('Swedish', 0.5444), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Slovene', 0.4991), ('Indonesian', 0.4928), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Portuguese', 0.457), ('Croatian', 0.4454), ('Polish', 0.4449), ('Turkish', 0.4304), ('Lithuanian', 0.3957), ('Vietnamese', 0.3901)] using iso8859_15
2024-08-29 16:07:57,289 | Level 5 | iso8859_16 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,289 | Level 5 | iso8859_16 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,292 | Level 5 | We detected language [('English', 0.6724), ('Norwegian', 0.6037), ('Danish', 0.5686), ('Italian', 0.5658), ('Estonian', 0.5537), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Swedish', 0.5367), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Slovene', 0.4991), ('Indonesian', 0.4928), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Portuguese', 0.457), ('Croatian', 0.4531), ('Polish', 0.4449), ('Turkish', 0.4304), ('Lithuanian', 0.3957), ('Vietnamese', 0.3901)] using iso8859_16
2024-08-29 16:07:57,299 | Level 5 | iso8859_2 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,300 | Level 5 | iso8859_2 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,301 | Level 5 | We detected language [('English', 0.6724), ('Norwegian', 0.6037), ('Danish', 0.5686), ('Italian', 0.5658), ('Estonian', 0.5537), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Swedish', 0.5367), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Slovene', 0.4991), ('Indonesian', 0.4928), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Portuguese', 0.457), ('Croatian', 0.4454), ('Polish', 0.4449), ('Turkish', 0.4304), ('Lithuanian', 0.3957), ('Vietnamese', 0.3901)] using iso8859_2
2024-08-29 16:07:57,307 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc3 in position 25290: character maps to <undefined>
2024-08-29 16:07:57,309 | Level 5 | iso8859_4 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,309 | Level 5 | iso8859_4 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,311 | Level 5 | We detected language [('English', 0.6724), ('Norwegian', 0.6114), ('Danish', 0.5763), ('Italian', 0.5658), ('Estonian', 0.5537), ('Swedish', 0.5444), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Slovene', 0.4991), ('Indonesian', 0.4928), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Portuguese', 0.457), ('Croatian', 0.4454), ('Polish', 0.4449), ('Turkish', 0.4304), ('Lithuanian', 0.3957), ('Vietnamese', 0.3901)] using iso8859_4
2024-08-29 16:07:57,319 | Level 5 | iso8859_5 passed initial chaos probing. Mean measured chaos is 2.400000 %
2024-08-29 16:07:57,319 | Level 5 | iso8859_5 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-08-29 16:07:57,327 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xb1 in position 25291: character maps to <undefined>
2024-08-29 16:07:57,329 | Level 5 | iso8859_7 passed initial chaos probing. Mean measured chaos is 2.333000 %
2024-08-29 16:07:57,329 | Level 5 | iso8859_7 should target any language(s) of ['Greek']
2024-08-29 16:07:57,336 | Level 5 | Code page iso8859_8 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xc3 in position 25290: character maps to <undefined>
2024-08-29 16:07:57,338 | Level 5 | iso8859_9 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,339 | Level 5 | iso8859_9 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,341 | Level 5 | We detected language [('English', 0.6724), ('Norwegian', 0.6114), ('Danish', 0.5763), ('Italian', 0.5658), ('Estonian', 0.5537), ('Swedish', 0.5444), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Slovene', 0.4991), ('Indonesian', 0.4928), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Portuguese', 0.457), ('Turkish', 0.4461), ('Croatian', 0.4454), ('Polish', 0.4449), ('Lithuanian', 0.3957), ('Vietnamese', 0.3901)] using iso8859_9
2024-08-29 16:07:57,349 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xd0 in position 25666: illegal multibyte sequence
2024-08-29 16:07:57,351 | Level 5 | koi8_r passed initial chaos probing. Mean measured chaos is 2.533000 %
2024-08-29 16:07:57,351 | Level 5 | koi8_r should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-08-29 16:07:57,359 | Level 5 | Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa9 in position 25380: character maps to <undefined>
2024-08-29 16:07:57,361 | Level 5 | koi8_u passed initial chaos probing. Mean measured chaos is 2.533000 %
2024-08-29 16:07:57,361 | Level 5 | koi8_u should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-08-29 16:07:57,369 | Level 5 | Code page kz1048 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0x98 in position 46325: character maps to <undefined>
2024-08-29 16:07:57,369 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 2.417000 %
2024-08-29 16:07:57,370 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,370 | Level 5 | We detected language [('English', 0.6724), ('Norwegian', 0.6114), ('Danish', 0.5763), ('Italian', 0.5658), ('Estonian', 0.5537), ('Swedish', 0.5444), ('Spanish', 0.5437), ('Dutch', 0.5419), ('Czech', 0.5418), ('Finnish', 0.5352), ('French', 0.5335), ('Hungarian', 0.5108), ('Slovene', 0.4991), ('Indonesian', 0.4928), ('German', 0.4779), ('Slovak', 0.4756), ('Romanian', 0.474), ('Portuguese', 0.457), ('Croatian', 0.4454), ('Polish', 0.4449), ('Turkish', 0.4304), ('Lithuanian', 0.3957), ('Vietnamese', 0.3901)] using latin_1
2024-08-29 16:07:57,379 | Level 5 | mac_cyrillic passed initial chaos probing. Mean measured chaos is 0.867000 %
2024-08-29 16:07:57,380 | Level 5 | mac_cyrillic should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-08-29 16:07:57,390 | Level 5 | mac_greek passed initial chaos probing. Mean measured chaos is 1.700000 %
2024-08-29 16:07:57,390 | Level 5 | mac_greek should target any language(s) of ['Greek']
2024-08-29 16:07:57,401 | Level 5 | mac_iceland passed initial chaos probing. Mean measured chaos is 1.200000 %
2024-08-29 16:07:57,401 | Level 5 | mac_iceland should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,403 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5708), ('Slovene', 0.5455), ('Danish', 0.5435), ('Italian', 0.5387), ('Estonian', 0.5301), ('Swedish', 0.5204), ('Finnish', 0.5193), ('French', 0.5172), ('Indonesian', 0.5103), ('Dutch', 0.5101), ('Czech', 0.5023), ('Hungarian', 0.4963), ('Spanish', 0.4867), ('German', 0.4602), ('Romanian', 0.4526), ('Portuguese', 0.4518), ('Polish', 0.4379), ('Croatian', 0.4234), ('Slovak', 0.4213), ('Turkish', 0.4091), ('Lithuanian', 0.4077), ('Vietnamese', 0.3646)] using mac_iceland
2024-08-29 16:07:57,414 | Level 5 | mac_latin2 passed initial chaos probing. Mean measured chaos is 1.083000 %
2024-08-29 16:07:57,414 | Level 5 | mac_latin2 should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,417 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5604), ('Slovene', 0.5455), ('Estonian', 0.527), ('Danish', 0.525), ('Swedish', 0.5159), ('Italian', 0.5134), ('Dutch', 0.5129), ('Indonesian', 0.5103), ('Finnish', 0.51), ('French', 0.5072), ('Czech', 0.4982), ('Hungarian', 0.487), ('German', 0.4617), ('Romanian', 0.4502), ('Spanish', 0.4414), ('Slovak', 0.4308), ('Polish', 0.4279), ('Croatian', 0.4269), ('Portuguese', 0.4114), ('Lithuanian', 0.4077), ('Turkish', 0.394), ('Vietnamese', 0.3637)] using mac_latin2
2024-08-29 16:07:57,426 | Level 5 | mac_roman passed initial chaos probing. Mean measured chaos is 1.200000 %
2024-08-29 16:07:57,426 | Level 5 | mac_roman should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,426 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5708), ('Slovene', 0.5455), ('Danish', 0.5435), ('Italian', 0.5387), ('Estonian', 0.5301), ('Swedish', 0.5204), ('Finnish', 0.5193), ('French', 0.5172), ('Indonesian', 0.5103), ('Dutch', 0.5101), ('Czech', 0.5023), ('Hungarian', 0.4963), ('Spanish', 0.4867), ('German', 0.4602), ('Romanian', 0.4526), ('Portuguese', 0.4518), ('Polish', 0.4379), ('Croatian', 0.4234), ('Slovak', 0.4213), ('Turkish', 0.4091), ('Lithuanian', 0.4077), ('Vietnamese', 0.3646)] using mac_roman
2024-08-29 16:07:57,436 | Level 5 | mac_turkish passed initial chaos probing. Mean measured chaos is 1.200000 %
2024-08-29 16:07:57,436 | Level 5 | mac_turkish should target any language(s) of ['Latin Based']
2024-08-29 16:07:57,436 | Level 5 | We detected language [('English', 0.6966), ('Norwegian', 0.5708), ('Slovene', 0.5455), ('Danish', 0.5435), ('Italian', 0.5387), ('Estonian', 0.5301), ('Swedish', 0.5204), ('Finnish', 0.5193), ('French', 0.5172), ('Indonesian', 0.5103), ('Dutch', 0.5101), ('Czech', 0.5023), ('Hungarian', 0.4963), ('Spanish', 0.4867), ('German', 0.4602), ('Romanian', 0.4526), ('Portuguese', 0.4518), ('Polish', 0.4379), ('Croatian', 0.4234), ('Slovak', 0.4213), ('Turkish', 0.4091), ('Lithuanian', 0.4077), ('Vietnamese', 0.3646)] using mac_turkish
2024-08-29 16:07:57,441 | Level 5 | ptcp154 passed initial chaos probing. Mean measured chaos is 0.333000 %
2024-08-29 16:07:57,441 | Level 5 | ptcp154 should target any language(s) of ['Russian', 'Ukrainian', 'Serbian', 'Bulgarian', 'Kazakh']
2024-08-29 16:07:57,452 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0x87 in position 25577: illegal multibyte sequence
2024-08-29 16:07:57,452 | Level 5 | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0x87 in position 25577: illegal multibyte sequence
2024-08-29 16:07:57,453 | Level 5 | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0x87 in position 25577: illegal multibyte sequence
2024-08-29 16:07:57,453 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xa0 in position 25667: character maps to <undefined>
2024-08-29 16:07:57,453 | Level 5 | Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2024-08-29 16:07:57,453 | Level 5 | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode bytes in position 25782-25783: illegal UTF-16 surrogate
2024-08-29 16:07:57,453 | Level 5 | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode bytes in position 25796-25797: illegal UTF-16 surrogate
2024-08-29 16:07:57,453 | Level 5 | Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2024-08-29 16:07:57,454 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2024-08-29 16:07:57,454 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2024-08-29 16:07:57,454 | Level 5 | Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
2024-08-29 16:07:57,454 | DEBUG | Encoding detection: Found gb18030 as plausible (best-candidate) for content. With 37 alternatives.
{
    "path": "/<REDACTED>/View Profile_ - Legion Gaming Community.html",
    "encoding": "gb18030",
    "encoding_aliases": [
        "gb18030_2000"
    ],
    "alternative_encodings": [],
    "language": "Chinese",
    "alphabets": [
        "Basic Latin",
        "CJK Unified Ideographs",
        "Control character",
        "Private Use Area"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.32,
    "coherence": 0.0,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding

Expected UTF-8. It appears that utf-8 decode fails at 1 specific byte while the rest of the content seems to be utf-8 compatible. In VS code, it looks like this:

image

Using UTF-8, the content is as expected...

image

Using gb18030 in vs code, it looks... odd...

image

Desktop (please complete the following information):

Additional context I've noticed this specific failure condition a couple of times while processing CommonCrawl. 1 workaround is to only allow gb18030 when an initial encoding isn't detected. However I might as well report this detection issue. \:)

Ousret commented 1 month ago

Unfortunately charset-normalizer cannot work around corrupted elements. Even for a single character. Same as https://github.com/jawah/charset_normalizer/issues/354

regards,

KaraKaraWitch commented 1 month ago

Unfortunately charset-normalizer cannot work around corrupted elements. Even for a single character. Same as #354

regards,

Thanks for the reply! Understood that it's a limitation in charaset normalizer.

I added a workaround to my code to check how much corrupted elements is ignorable. If it's more than a certain %, fallback to the guessed_encoding (from charset normalizer).

For those wondering here's a snippet of the code. It's a bit messy but the comments should be enough:

def get_errored_decodable_counts(encoding: str, data: bytes) -> int:
    """Counts the number of failed unicode characters from a given encoding

    Args:
        encoding (str): The encoding to test
        data (bytes): The bytes data to try and decode from

    Returns:
        int: The number of fillter unicode characters
    """

    return data.decode(encoding, errors="replace").count("\ufffd")

def is_codec_exists(encoding: str):
    try:
        codecs.lookup(encoding)
        return True
    except LookupError:
        return False

orig_encoding = ""
record_content = b"Bytes content"
failurecounts = None
if orig_encoding and is_codec_exists(orig_encoding):
    failurecounts = get_errored_decodable_counts(
        orig_encoding, record_content
    ) / len(record_content)

# ... Further down the code looks like this

# Use original encoding if the original encoding looks better overall.
elif guessed_encoding != orig_encoding and not is_codec_exists(orig_encoding):
    # Guess we have to use the guessed encoding
    filter_comments += (
        f'<[Original Encoding] "{orig_encoding}" does not exist. Using guessed.>'
    )
    correct_encoding = guessed_encoding
elif guessed_encoding != orig_encoding and failurecounts:
    filter_comments += f"<fc of [O] {orig_encoding} [G] {guessed_encoding} [%] {round(failurecounts*100,ndigits=2)}%>"
    FC_PERCENT = 0.25
    if failurecounts * 100 < FC_PERCENT:
        filter_comments += f"<[UseOrig] [EDecode] fc usage < {FC_PERCENT}%>"
        correct_encoding = orig_encoding
    else:
        filter_comments += f"<[UseGuess] [EDecode] fc usage > {FC_PERCENT}>"