huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

fuzz: Add a BPE training fuzzer #1396

Closed silvergasp closed 6 months ago

silvergasp commented 7 months ago

See #1397 for context. Also note that I've found a couple of non-critical bugs with the fuzzer's already. I'll open up some issues with further details shortly :)