huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Integration with google/oss-fuzz for continuous fuzzing #1397

Closed silvergasp closed 6 months ago

silvergasp commented 7 months ago

Hey Team,

I hope this message finds you well. I've been following along with the huggingface tooling for llm's for some time now, and I really enjoy the open community offered by huggingface and it's users. I'd like to suggest and champion an effort to set up some basic fuzz-testing and combine it with google/oss-fuzz for continuous fuzzing. I'm fully aware that you are very busy people and I don't want to overload your review/maintenance capacity by introducing too many new ideas. Is this a bad time to discuss potential security/reliability improvements?

If your not familiar with fuzzing or google/oss-fuzz I've included a few brief notes below.

Benefits of Fuzz-Testing

Google/oss-fuzz for Continuous Fuzzing

I’d be more than happy to lead the effort in integrating fuzz testing with the huggingface/tokenizers and assist in any way required.

As a proof of concept I created a fuzz harness for the BPE tokenizer in #1396.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.