Closed scruel closed 6 months ago
Hey, I took a peek under the hood and looks like setting add_prefix_true
is only changing kwargs[slow]=True
(in tokenization_llama_fast.py. The super().__init__()
method should receive this parameter if set.
Passing this in seems to work in preliminary tests
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=False)
>>> tokenizer.tokenize('overheard')
['over', 'he', 'ard']
Mind if I take this up @ArthurZucker & @scruel?
Edit: For completeness, showing that behavior is unchanged when add_prefix_space=True
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf", add_prefix_space=True)
>>> tokenizer.tokenize('overheard')
>>> ['\u2581over', 'he', 'ard']
You always can take by creating a PR.
Thank you, made a pull request. This was happening in T5TokenizerFast
as well.
Thanks I'll review asap!
closing as #28881 fixed it!
@ArthurZucker are you sure this is fixed? I am still experiencing this in 4.41.0:
I can also still not see it being used here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/tokenization_llama_fast.py#L153
You need to se from_slow=True
to trigger conversion
It is used in convert_slow
😉
This is very confusing and not transparent to the user at all.
If I just use the AutoTokenizer
class with default settings I would expect this to work and not silently do nothing.
It should at least give a warning, or rather set the from_slow
then automatically.
I agree with you, on main there is this:
if add_prefix_space is not None:
logger.warning_once(
"You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers"
)
kwargs["from_slow"] = True
which should give you a warning and automatically convert it
But it does not seem to be taken into account. @itazap would be nice if you can investigate and open a PR to make sure it forces from flow:
In [1]: from transformers import AutoTokenizer
tokenizer
In [2]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False)
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
In [3]: tokenizer.encode("Hey")
Out[3]: [1, 18637]
In [4]: tokenizer.tokenize("Hey")
Out[4]: ['▁Hey']
In [5]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False, from_slow=True)
In [6]: tokenizer.tokenize("Hey")
Out[6]: ['H', 'ey']
In [7]: tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf",add_prefix_space=False)
In [8]: tokenizer.tokenize("Hey")
Out[8]: ['▁Hey']
^^ Thanks
Another thing I noted, is that if I specify from_slow
in tokenizer_config.json
then it is ignored. Is this expected behavior?
I think it should be taken into account!
Apparently not, need to manually set it.
When I manually add it to tokenizer_config.json on main it works
I am still struggling to understand how this exactly works with a combination of all different settings for the tokenizer. I believe a tutorial / docs description would be very helpful there.
For example, this breaks:
tokenizer = AutoTokenizer.from_pretrained(
"deepseek-ai/deepseek-coder-6.7b-base",
add_prefix_space=False,
)
I assume because it forces from_slow=True
but cannot find any SentencePiece model file. But why can't I use add_prefix_space
with just a fast tokenizer?
@ArthurZucker
This will be supported very soon, @itazap is working on making all of this a lot simpler and clearer! And yes, that is what's happening but shoul;d not, we should not need sentencepiece dependency to update prefix space. That's a mistake on my par sorry about it 😢
Looking forward to it :) @itazap
Actually, would be great if one could pass add_prefix_space
to the tokenize function itself, instead of needing to pass it when creating the tokenizer. Currently it is really unflexible if one wants to tokenize separate parts and then concatenate them afterwards without such a functionality.
Please tag me on any prs, happy to give feedback on this in general.
I kinda agree with you and will see what I can do on the tokenizers
side, but this might need a lot of changes (supporting a new argument) and can already be done but with the attribute that needs to be set at each call, not super optimal
System Info
transformers
version: 4.38.2With
sentencepiece==0.2.0
andprotobuf==4.25.3
installedWho can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Also tried
add_dummy_prefix_space=False
, the output is still the same.Expected behavior
The tokenize result should not add prefix space (
SPIECE_UNDERLINE
)