karpathy / minbpe

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
MIT License
8.75k stars 796 forks source link

Using minBPE token encoded sentence vectors need to be padded #56

Open elevateclub opened 3 months ago

elevateclub commented 3 months ago

Without the padding, the sentences end up being different sizes and we get stacking errors at data loading time.

elevateclub commented 3 months ago

Would probably require the introduction of a '' special character which might make the code feel a bit more edge casey, digging us deeper into the ugliness that is tokenization 😢