[DETECTION] fails on short input

Notice

I hereby announce that my raw input is not :

~~Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content~~
- no. my input is 1 character
Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Input

>>> "ü".encode("latin1")
b'\xfc'

Output

>>> import charset_normalizer, magic, chardet

>>> list(map(lambda m: m.encoding, charset_normalizer.from_bytes("ü".encode("latin1"))._results))
['cp037', 'cp1006', 'cp1026', 'cp1250', 'cp1251', 'cp1253', 'cp273', 'cp437', 'cp775', 'cp852', 'cp855', 'cp864', 'cp869', 'cp875', 'iso8859_5', 'koi8_r', 'mac_greek', 'mac_latin2']

>>> chardet.detect("ü".encode("latin1"))["encoding"]
'ISO-8859-1'

>>> list(map(lambda m: m["encoding"], chardet.detect_all("ü".encode("latin1"))))
['ISO-8859-1']

>>> magic.detect_from_content("ü".encode("latin1")).encoding
'binary'

Expected

latin1 aka ISO-8859-1

for short inputs, charset_normalizer should use chardet

Env

charset_normalizer version 3.3.2

Context

the readme says

Known limitations

Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.

"every"? chardet just works on short inputs... at least with latin1 encoding : P

the chrome browser seems to use chardet or magic to guess non-utf8 filename encodings of downloads

Longer Input

>>> "ungekürzt".encode("latin1")
b'ungek\xfcrzt'

>>> len("ungekürzt".encode("latin1"))
9

>>> import charset_normalizer, magic, chardet

>>> list(map(lambda m: m.encoding, charset_normalizer.from_bytes("ungekürzt".encode("latin1"))._results))
['big5hkscs', 'gb18030', 'shift_jis_2004', 'cp1006', 'cp1125', 'cp1250', 'cp1251', 'cp1253', 'cp437', 'cp775', 'cp864', 'cp869', 'cp875', 'hp_roman8', 'iso8859_5', 'mac_greek', 'mac_iceland']

>>> chardet.detect("ungekürzt".encode("latin1"))["encoding"]
'ISO-8859-1'

>>> list(map(lambda m: m["encoding"], chardet.detect_all("ungekürzt".encode("latin1"))))
['ISO-8859-1', 'ISO-8859-9']

>>> magic.detect_from_content("ungekürzt".encode("latin1")).encoding
'iso-8859-1'

I understand the frustration around very small payloads. Nevertheless, no evidence emerge that "other" algorithm have a way of determining the "right" encoding for the presented case.

Any algorithms giving answer for less than 30 characters is having a bold guess, less than 8 is often pure lucks.

If you extended "alternative" algorithms to 99 charsets (as we can), they would output chaotic results (I tried before doing this one)

My suggestion here is to tweak the detector as explained in https://github.com/jawah/charset_normalizer/issues/477

charset-normalizer isn't made to be aware of the context of giving input. So I can suggest to you the following:

if my_context == DectectUseCase.FILENAME:
   ...
elif my_context == DectectUseCase.CODE:
   ...
elif my_context == ...TEXT
   ...

Each presented scenarii is expected to have a limited range of charset.

regards,

jawah / charset_normalizer

[DETECTION] fails on short input #486