karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.19k stars 866 forks source link

Would using prompts that contain concatenated words to reduce token count negatively affect results #61

Open hatgit opened 7 months ago

hatgit commented 7 months ago

Background: I developed a rudimentary way to reduce token count for long prompts by concatenating words of a certain length, which has the potential to reduce API token costs by a few % , which can be significant for companies with high API costs from prompt token usage (regardless of their completion token costs which can remain constant).

Question for tokenizing: I am wondering if this approach has any negative affect as output seems unaffected, with completions returning normally.

See this thread with some of the pros/cons: https://community.openai.com/t/removing-spaces-from-prompts-to-maximize-character-limits-i-e-in-gpt-config/684125