jawah / charset_normalizer

Truly universal encoding detector in pure Python
https://charset-normalizer.readthedocs.io/en/latest/
MIT License
568 stars 51 forks source link

[DETECTION] On the question of speed #428

Closed Byxs20 closed 6 months ago

Byxs20 commented 6 months ago
import time

import chardet
import charset_normalizer

def wapper(text):
    for _ in range(10):
        chardet.detect(text)

    for _ in range(10):
        charset_normalizer.detect(text)

def get_time(text, num):
    start = time.time()
    for _ in range(num):
        res1 = chardet.detect(text)
    spend_time = time.time() - start
    print(spend_time, spend_time / num)

    start = time.time()
    for _ in range(num):
        res2 = charset_normalizer.detect(text)
    spend_time = time.time() - start
    print(spend_time, spend_time / num)

    print(f"chardet: {res1}")
    print(f"charset_normalizer: {res2}", end='\n\n')

if __name__ == '__main__':
    text1 = ("Hello, World!" * 50).encode()
    text2 = ("你好,测试一下UTF-8" * 50).encode('utf-8')

    wapper(text1)
    get_time(text=text1, num=1000)
    get_time(text=text2, num=1000)

Hi! When I did the speed test, I found that it seemed to be slower in English, but the improvement in Chinese was huge. Overall the speed is not slow. I'm just stating my doubts and overall your project is very strong! I should take the time to replace the package I made for you.

image

image

image

Ousret commented 6 months ago

Your observation seems right. Indeed, UTF-8 (what you labelled as English) inference will take a bit longer overall because our detector is more complex.

Take this for example: Charset Detection, for Everyone 👋 that encode to Charset Detection, for Everyone \xf0\x9f\x91\x8b

>>> chardet.detect(b'Charset Detection, for Everyone \xf0\x9f\x91\x8b')
{'encoding': 'Windows-1254', 'confidence': 0.4957960183590231, 'language': 'Turkish'}

>>> charset_normalizer.detect(b'Charset Detection, for Everyone \xf0\x9f\x91\x8b')
{'encoding': 'utf-8', 'language': '', 'confidence': 1.0}

Or Je suis pas d'accord avec Ahméd that translate to Je suis pas d'accord avec Ahm\xc3\xa9d.

>>> chardet.detect(b"Je suis pas d'accord avec Ahm\xc3\xa9d")
{'encoding': 'ISO-8859-9', 'confidence': 0.5648588804140238, 'language': 'Turkish'}

>>> charset_normalizer.detect(b"Je suis pas d'accord avec Ahm\xc3\xa9d")
{'encoding': 'utf-8', 'language': '', 'confidence': 1.0}

We couldn't do faster without some small compromise with Unicode.

Hope that clarify. Enjoy charset-normalizer capabilities.

Byxs20 commented 6 months ago

Thanks for the open source project, it's really awesome!