javirandor / anthropic-tokenizer

Approximation of the Claude 3 tokenizer by inspecting generation stream
MIT License
109 stars 9 forks source link

Switch to Haiku to save 60x over Opus; no noticed quality drop #1

Closed CLARKBENHAM closed 6 months ago

CLARKBENHAM commented 6 months ago

I tested using Haiku instead of Opus, because Haiku will be 60x cheaper.

You can run the test script to compare all 3 models with: python -m scripts.test_tokenization.py In this small sample all models returned the original string equally often.

The different models disagree/get the wrong answer when there's random Unicode. I'm not sure why. eg. on ᵈ᭬᳛ ("".join(chr(i) for i in [7496, 7387, 7020])) all 3 models gets the order flipped instead writing [7496, 7020, 7387].

Generally where the models disagree the unicode has 2-3 byte UTF-8 encodings, but the bytes are often the same length so it's not because when bytes stream back sometimes it splits a character. eg. On one run of s5 the models on disagreed only on 1 location out of 100: 'Ⓡ' vs 'ⓘ' vs '⊙' ; but these all have a 3 byte UTF-8 encoding. (But this is non-deterministic, on most runs only sonnet writes ⓘ while both Opus and Haiku write the correct ⊙).