>>> print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
<s> I foo you < br > hello world
>>> print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
<s> I bar you
hello world
The same process above won't work for "mistralai/Mistral-7B-v0.3".
I'm not sure if it's a bug/feature, sometimes modifying the normalizer of a pretrained tokenizer works but sometimes it doesn't.
For example, it works for
"mistralai/Mistral-7B-v0.1"
but not"mistralai/Mistral-7B-v0.3"
:[out]:
The same process above won't work for
"mistralai/Mistral-7B-v0.3"
.