I hope this message finds you well. I've been following along with the huggingface tooling for llm's for some time now, and I really enjoy the open community offered by huggingface and it's users. I'd like to suggest and champion an effort to set up some basic fuzz-testing and combine it with google/oss-fuzz for continuous fuzzing. I'm fully aware that you are very busy people and I don't want to overload your review/maintenance capacity by introducing too many new ideas. Is this a bad time to discuss potential security/reliability improvements?
If your not familiar with fuzzing or google/oss-fuzz I've included a few brief notes below.
Benefits of Fuzz-Testing
Dynamic Code Testing: Fuzz-testing challenges systems with unexpected data, aiming to identify vulnerabilities. It’s akin to an exhaustive stress-test for the code.
Detecting Hidden Vulnerabilities: It can uncover potential weaknesses that may not be evident in routine tests.
Continuous and Automated Testing: With tools like Google’s OSS-Fuzz, fuzz-testing can be automated, running continuously on distributed systems, ensuring daily resilience checks.
Google/oss-fuzz for Continuous Fuzzing
Automated Fuzzing: OSS-Fuzz undertakes comprehensive fuzz-testing daily on a distributed cluster.
Security Boost: It provides enhanced security measures free of cost, thanks to Google’s backing.
Detailed Reporting: OSS-Fuzz offers exhaustive reports in case of detected anomalies, enabling effective action.
I’d be more than happy to lead the effort in integrating fuzz testing with the huggingface/tokenizers and assist in any way required.
As a proof of concept I created a fuzz harness for the BPE tokenizer in #1396.
Hey Team,
I hope this message finds you well. I've been following along with the huggingface tooling for llm's for some time now, and I really enjoy the open community offered by huggingface and it's users. I'd like to suggest and champion an effort to set up some basic fuzz-testing and combine it with google/oss-fuzz for continuous fuzzing. I'm fully aware that you are very busy people and I don't want to overload your review/maintenance capacity by introducing too many new ideas. Is this a bad time to discuss potential security/reliability improvements?
If your not familiar with fuzzing or google/oss-fuzz I've included a few brief notes below.
Benefits of Fuzz-Testing
Google/oss-fuzz for Continuous Fuzzing
I’d be more than happy to lead the effort in integrating fuzz testing with the huggingface/tokenizers and assist in any way required.
As a proof of concept I created a fuzz harness for the BPE tokenizer in #1396.