Roberta tokenizer - first word in sentence doesn't match huggingface tokenizer

In the the original roberta tokenizer words are treated differently if they appear in the beginning of a sentence, i.e. they don't have a space before them:

For example the following code:

tok_hugging_face = RobertaTokenizer.from_pretrained('roberta-base')
tok_blingfire = blingfire.load_model(os.path.join(os.path.dirname(blingfire.__file__), "roberta.bin"))

sentence = "test"
print(f'Sentence - {sentence}')  
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')  
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 1, 100)}')  
print()
sentence = "something test"
print(f'Sentence - {sentence}')
print(f'Hugging Face - {tok_hugging_face(sentence)["input_ids"]}')
print(f'BlingFire - {blingfire.text_to_ids(tok_blingfire, sentence, 2, 100)}')

Produces the following output:

Sentence - test
Hugging Face - [0, 21959, 2]
BlingFire - [1296]

Sentence - something test
Hugging Face - [0, 18891, 1296, 2]
BlingFire - [ 402 1296]

In hugging face 0 and 2 are start and end tokens so they can be ignored. As you can see, the word "test" received the same ID in both cases in BlingFire whereas in HuggingFace it's different.

microsoft / BlingFire

Roberta tokenizer - first word in sentence doesn't match huggingface tokenizer #113