Open tomateb opened 3 years ago
I think it has to do with the add_prefix_space=True / False parameter that Huggingface has, probably the default behaviour is different. Could you please try adding blingfire.change_settings_dummy_prefix(h, False) call after you have loaded the model, as shown here:
https://github.com/microsoft/BlingFire/issues/82#issuecomment-834665049
In the the original roberta tokenizer words are treated differently if they appear in the beginning of a sentence, i.e. they don't have a space before them:
For example the following code:
Produces the following output:
In hugging face 0 and 2 are start and end tokens so they can be ignored. As you can see, the word "test" received the same ID in both cases in BlingFire whereas in HuggingFace it's different.