Closed savanth14 closed 3 months ago
You have nicely demonstrated and answered your question here by training a new tokenizer on a new vocab. 👏
Train a new tokenizer using the am_train and the old tokenizer object.
Hey @savanth14 you should set normalized=False
when you add the token.
I recommend you to set legacy=False
to make sure you don't have these issues:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/mistral-7b-v0.1", legacy=False, from_slow=True)
tokenizer.add_tokens([AddedToken("<mytoken>", normalized=False)])
Using main with https://github.com/huggingface/transformers/pull/28881 merged will also help. The issue with space is basically that if you normalize the token
FYI @itazap
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
@ArthurZucker @younesbelkada @Narsil @n1t0 I tried to add new vocab to the existing mistral tokenizer vocab using the add_tokens() method. Everything went fine till I tried the extended vocab tokenizer for decoding the encoded text. I found that in the decoded text, the spaces are completely missing and all the decoded tokens are merged into a single string. Can you please help me resolve this issue. Here's the sample code:
To dig further into the problem, I re-initialised the mistral tokenizer from its original checkpoint "mistralai/mistral-7b-v0.1". Then I added 3 manually defined random tokens to the tokenizer using the same add_tokens method. Now I used the extended vocab tokenizer to encode and decode some text and it worked fine. I mean, the decoded text has retained the spacing similar to the original random text. Here's the code for this experiment:
Where is the problem? Why is the extended vocab tokenizer not able to decode properly when using the vocab from a different tokenizer? On the contrary, it is able to decode properly when new tokens are added manually.
In addition, I used the
train_new_from_iterator
method and trained a new tokenizer based on the mistral tokenizer. Then I used the same approach as above to extend the vocab of the old tokenizer. When I used this extended vocab tokenizer for decoding, I observed that "some spaces are missing while some of the tokens are merged".Can you please suggest me how to fix this issue.