jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
562 stars 51 forks source link

[DETECTION] fails on short input #486

Closed milahu closed 2 months ago

milahu commented 2 months ago

Notice

I hereby announce that my raw input is not :

Input

>>> "ü".encode("latin1")
b'\xfc'

Output

>>> import charset_normalizer, magic, chardet

>>> list(map(lambda m: m.encoding, charset_normalizer.from_bytes("ü".encode("latin1"))._results))
['cp037', 'cp1006', 'cp1026', 'cp1250', 'cp1251', 'cp1253', 'cp273', 'cp437', 'cp775', 'cp852', 'cp855', 'cp864', 'cp869', 'cp875', 'iso8859_5', 'koi8_r', 'mac_greek', 'mac_latin2']

>>> chardet.detect("ü".encode("latin1"))["encoding"]
'ISO-8859-1'

>>> list(map(lambda m: m["encoding"], chardet.detect_all("ü".encode("latin1"))))
['ISO-8859-1']

>>> magic.detect_from_content("ü".encode("latin1")).encoding
'binary'

Expected

latin1 aka ISO-8859-1

for short inputs, charset_normalizer should use chardet

Env

charset_normalizer version 3.3.2

Context

the readme says

Known limitations

Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.

"every"? chardet just works on short inputs... at least with latin1 encoding : P

the chrome browser seems to use chardet or magic to guess non-utf8 filename encodings of downloads

Longer Input

>>> "ungekürzt".encode("latin1")
b'ungek\xfcrzt'

>>> len("ungekürzt".encode("latin1"))
9

>>> import charset_normalizer, magic, chardet

>>> list(map(lambda m: m.encoding, charset_normalizer.from_bytes("ungekürzt".encode("latin1"))._results))
['big5hkscs', 'gb18030', 'shift_jis_2004', 'cp1006', 'cp1125', 'cp1250', 'cp1251', 'cp1253', 'cp437', 'cp775', 'cp864', 'cp869', 'cp875', 'hp_roman8', 'iso8859_5', 'mac_greek', 'mac_iceland']

>>> chardet.detect("ungekürzt".encode("latin1"))["encoding"]
'ISO-8859-1'

>>> list(map(lambda m: m["encoding"], chardet.detect_all("ungekürzt".encode("latin1"))))
['ISO-8859-1', 'ISO-8859-9']

>>> magic.detect_from_content("ungekürzt".encode("latin1")).encoding
'iso-8859-1'
Ousret commented 2 months ago

I understand the frustration around very small payloads. Nevertheless, no evidence emerge that "other" algorithm have a way of determining the "right" encoding for the presented case.

Any algorithms giving answer for less than 30 characters is having a bold guess, less than 8 is often pure lucks.

If you extended "alternative" algorithms to 99 charsets (as we can), they would output chaotic results (I tried before doing this one)

My suggestion here is to tweak the detector as explained in https://github.com/jawah/charset_normalizer/issues/477

charset-normalizer isn't made to be aware of the context of giving input. So I can suggest to you the following:

if my_context == DectectUseCase.FILENAME:
   ...
elif my_context == DectectUseCase.CODE:
   ...
elif my_context == ...TEXT
   ...

Each presented scenarii is expected to have a limited range of charset.

regards,