vitoplantamura / OnnxStream

Lightweight inference library for ONNX files, written in C++. It can run Stable Diffusion XL 1.0 on a RPI Zero 2 (or in 298MB of RAM) but also Mistral 7B on desktops and servers. ARM, x86, WASM, RISC-V supported. Accelerated by XNNPACK.
https://yolo.vitoplantamura.com/
Other
1.86k stars 84 forks source link

Add Byte Pair Encoding for better text tokenization #49

Closed AeroX2 closed 10 months ago

AeroX2 commented 10 months ago

This is adding Byte Pair Encoding to fix that issue with racoon vs raccoon during tokenization

It'll require the extra merges.txt file but I think it is a small price to pay for much better tokenization. I've added a version of it here: https://huggingface.co/AeroX2/stable-diffusion-xl-turbo-1.0-onnxstream/blob/main/sdxl_tokenizer/merges.txt, but it is essentially https://huggingface.co/stabilityai/sdxl-turbo/blob/main/tokenizer/merges.txt without the version line at the top of the file.

The bpe() algorithm is a direct translation of https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/tokenization_clip.py#L437 and could probably be improved a lot but it seems to do the job nicely

vitoplantamura commented 10 months ago

hi James,

I'm thinking about how to properly handle the problem, since running without your byte pair encoder results in subpar tokenization.

We should probably remove the "--bpe" flag and check at runtime whether the "(sdxl_)tokenizer/merges.txt" file exists or not.

If the file does not exist, we may display a warning like this:

"WARNING: The merges.txt file is missing from the tokenizer folder. Running without byte pair encoding results in subpar tokenization. The file can be downloaded here: https://huggingface.co/AeroX2/stable-diffusion-xl- turbo-1.0-onnxstream/blob/main/sdxl_tokenizer/merges.txt"

what do you think about it?

Thanks, Vito

vitoplantamura commented 10 months ago

Merged!!

Thank you, Vito