Closed komninoschatzipapas closed 2 months ago
Hey! Thanks, a fix can be derived from #1357 and https://github.com/huggingface/transformers/pull/26678.
Everything you describe is mentioned there. TLDR; use metaspace with prepend_scheme="first"
and no normalizer will be the end of you problems
I have not had the time to change the default llama fast tokenizer, will try to do asap
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I think this is still relevant
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This was fixed in transformers
you need to set legacy=False
🤗
I have downloaded the Mistral 7B tokenizer locally and tried to compare different combinations of the
legacy
anduse_fast
options:Which yields:
You can find the full code here.
There seem to be inconsistencies with how
legacy=False, use_fast=False
tokenizes input compared to the other options.If either option is set to
True
, there is an extra space added after tokens like<unk>
or other special tokens.It seems to me that only
legacy=False, use_fast=False
tokenenizes this input correctly.We have a production app that extends Mistral with other special tokens besides
<unk>
, and extra spaces are added after those too.So right now, we have switched over to
legacy=False, use_fast=False
, not getting any of the speed advantages of the Rust implementation.Would appreciate any insight to what we are missing! And thank you for the enormous amount of work you guys have put into this library 🙏