karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
9.08k stars 839 forks source link

Count only nonoverlapping occurences of a pair #72

Open Majdoddin opened 4 months ago

Majdoddin commented 4 months ago

For example, you want to count 2 (not 4) occurrences of the pair 'aa' in text 'aaaaa', because merge() can replace it just 2 times. In other words the counted occurrences should not overlap.

See the discussion in #51 sentencepease also implements it this way, per this paper (p. 3).