Closed Byxs20 closed 6 months ago
Your observation seems right. Indeed, UTF-8 (what you labelled as English) inference will take a bit longer overall because our detector is more complex.
Take this for example: Charset Detection, for Everyone 👋
that encode to Charset Detection, for Everyone \xf0\x9f\x91\x8b
>>> chardet.detect(b'Charset Detection, for Everyone \xf0\x9f\x91\x8b')
{'encoding': 'Windows-1254', 'confidence': 0.4957960183590231, 'language': 'Turkish'}
>>> charset_normalizer.detect(b'Charset Detection, for Everyone \xf0\x9f\x91\x8b')
{'encoding': 'utf-8', 'language': '', 'confidence': 1.0}
Or Je suis pas d'accord avec Ahméd
that translate to Je suis pas d'accord avec Ahm\xc3\xa9d
.
>>> chardet.detect(b"Je suis pas d'accord avec Ahm\xc3\xa9d")
{'encoding': 'ISO-8859-9', 'confidence': 0.5648588804140238, 'language': 'Turkish'}
>>> charset_normalizer.detect(b"Je suis pas d'accord avec Ahm\xc3\xa9d")
{'encoding': 'utf-8', 'language': '', 'confidence': 1.0}
We couldn't do faster without some small compromise with Unicode.
Hope that clarify. Enjoy charset-normalizer capabilities.
Thanks for the open source project, it's really awesome!
Hi! When I did the speed test, I found that it seemed to be slower in English, but the improvement in Chinese was huge. Overall the speed is not slow. I'm just stating my doubts and overall your project is very strong! I should take the time to replace the package I made for you.