microsoft / BlingFire

A lightning fast Finite State machine and REgular expression manipulation library.
MIT License
1.83k stars 128 forks source link

Roberta tokenizer - first word in sentence doesn't match huggingface tokenizer #113

Open tomateb opened 3 years ago

tomateb commented 3 years ago

In the the original roberta tokenizer words are treated differently if they appear in the beginning of a sentence, i.e. they don't have a space before them:

For example the following code:

tok_hugging_face = RobertaTokenizer.from_pretrained('roberta-base')
tok_blingfire = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__), "roberta.bin"))

sentence = "test"
print(f'Sentence - {sentence}')  
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')  
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 1, 100)}')  
print()
sentence = "something test"
print(f'Sentence - {sentence}')
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 2, 100)}')

Produces the following output:

Sentence - test
Hugging Face - [0, 21959, 2]
BlingFire - [1296]

Sentence - something test
Hugging Face - [0, 18891, 1296, 2]
BlingFire - [ 402 1296]

In hugging face 0 and 2 are start and end tokens so they can be ignored. As you can see, the word "test" received the same ID in both cases in BlingFire whereas in HuggingFace it's different.

SergeiAlonichau commented 3 years ago

I think it has to do with the add_prefix_space=True / False parameter that Huggingface has, probably the default behaviour is different. Could you please try adding blingfire.change_settings_dummy_prefix(h, False) call after you have loaded the model, as shown here:

https://github.com/microsoft/BlingFire/issues/82#issuecomment-834665049