google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.12k stars 1.17k forks source link

Issue with XLM-RoBERTa tokenizer #858

Closed ozanarmagan closed 1 year ago

ozanarmagan commented 1 year ago

Hi;

I have a problem with encoding with XLM-RoBERTa sentencepiece tokenizer. Why is the hugging face encoding 1 greater compared to the google sentencepiece encoding?

Example

## Hugging Face:
tokenizer_xlmroberta.encode("I don't understand why",add_special_tokens=False)

Output: [87, 2301, 25, 18, 28219, 15400]

## Sentencepiece:
tokenizer_xlmroberta_.encode_as_ids("I don't understand why")

Output: [86, 2300, 24, 17, 28218, 15399]

taku910 commented 1 year ago

Sorry, we can't accept questions about libraries using sentencepiece, as we are not the developer; please ask the developer of XML-RoBERTa.