karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.2k stars 866 forks source link

updating stats across merge to reduce computation #88

Open imdaredevil opened 2 months ago

imdaredevil commented 2 months ago

Instead of computing stats from scratch for every merge, we can calculate it once and update it during merge. This results in reduced computation as we update the stats dictionary only for tokens that are affected by the merge.