jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
580 stars 51 forks source link

[BUG] identifies UTF16LE for a pair of ascii punctuation characters #509

Closed GavinHuttley closed 1 month ago

GavinHuttley commented 2 months ago

Describe the bug Introducing conventional ascii text returns UTF-16LE encoding

To Reproduce

import chardet, charset_normalizer

charset_normalizer.detect(b");")  # error also happens with b"(;"
# returns  {'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}
chardet.detect(b");")
# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

Expected behavior

These are standard ASCII characters, I expect a UTF-8 encoding

Desktop (please complete the following information):

Additional context Evaluate either b"(", b")", b";" or b"()" produces the expected result. There are other combinations of punctuation characters that produce the same error, e.g. b".;".

I understand this is a very small string but perhaps a default to the minimum character set?

Ousret commented 1 month ago

We fixed that case. It will be available in the next release.