Closed AeroX2 closed 10 months ago
hi James,
I'm thinking about how to properly handle the problem, since running without your byte pair encoder results in subpar tokenization.
We should probably remove the "--bpe" flag and check at runtime whether the "(sdxl_)tokenizer/merges.txt" file exists or not.
If the file does not exist, we may display a warning like this:
"WARNING: The merges.txt file is missing from the tokenizer folder. Running without byte pair encoding results in subpar tokenization. The file can be downloaded here: https://huggingface.co/AeroX2/stable-diffusion-xl- turbo-1.0-onnxstream/blob/main/sdxl_tokenizer/merges.txt"
what do you think about it?
Thanks, Vito
Merged!!
Thank you, Vito
This is adding Byte Pair Encoding to fix that issue with
racoon
vsraccoon
during tokenizationIt'll require the extra
merges.txt
file but I think it is a small price to pay for much better tokenization. I've added a version of it here: https://huggingface.co/AeroX2/stable-diffusion-xl-turbo-1.0-onnxstream/blob/main/sdxl_tokenizer/merges.txt, but it is essentially https://huggingface.co/stabilityai/sdxl-turbo/blob/main/tokenizer/merges.txt without the version line at the top of the file.The
bpe()
algorithm is a direct translation of https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/tokenization_clip.py#L437 and could probably be improved a lot but it seems to do the job nicely