Closed milahu closed 2 months ago
I understand the frustration around very small payloads. Nevertheless, no evidence emerge that "other" algorithm have a way of determining the "right" encoding for the presented case.
Any algorithms giving answer for less than 30 characters is having a bold guess, less than 8 is often pure lucks.
If you extended "alternative" algorithms to 99 charsets (as we can), they would output chaotic results (I tried before doing this one)
My suggestion here is to tweak the detector as explained in https://github.com/jawah/charset_normalizer/issues/477
charset-normalizer isn't made to be aware of the context of giving input. So I can suggest to you the following:
if my_context == DectectUseCase.FILENAME:
...
elif my_context == DectectUseCase.CODE:
...
elif my_context == ...TEXT
...
Each presented scenarii is expected to have a limited range of charset.
regards,
Notice
I hereby announce that my raw input is not :
Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on contentInput
Output
Expected
latin1
akaISO-8859-1
for short inputs,
charset_normalizer
should usechardet
Env
charset_normalizer version 3.3.2
Context
the readme says
"every"? chardet just works on short inputs... at least with latin1 encoding : P
the
chrome
browser seems to usechardet
ormagic
to guess non-utf8 filename encodings of downloadsLonger Input