google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.06k stars 1.16k forks source link

Why is the Hugging Face encoding 1 greater compared to the Google SentencePiece encoding when using the XLM-RoBERTa SentencePiece tokenizer? #1042

Closed RaoufiTech closed 3 weeks ago

RaoufiTech commented 3 weeks ago

Hi;

I have a problem with encoding with XLM-RoBERTa sentencepiece tokenizer. Why is the hugging face encoding 1 greater compared to the google sentencepiece encoding?

Example

Hugging Face:

tokenizer_xlmroberta.encode("I don't understand why",add_special_tokens=False) Output: [87, 2301, 25, 18, 28219, 15400]

Sentencepiece:

tokenizerxlmroberta.encode_as_ids("I don't understand why") Output: [86, 2300, 24, 17, 28218, 15399]

taku910 commented 3 weeks ago

Questions about third-party products using sentencepiece should be directed to the developer of that product. We are not the developer, so we don't really know why.

RaoufiTech commented 3 weeks ago

Hi,

The discrepancy in token IDs between the Hugging Face tokenizer and the SentencePiece tokenizer is due to the indexing scheme. It appears that the Hugging Face tokenizer starts its token IDs from 1, while the SentencePiece tokenizer starts from 0. This results in a consistent offset of 1 between the two sets of token IDs.

To align the SentencePiece tokenizer with the Hugging Face tokenizer, you can adjust the indexing so that the SentencePiece tokenizer starts its token IDs from 1 instead of 0. This adjustment would effectively match the token IDs generated by both tokenizers without needing to add or modify special tokens.

Could you provide guidance on how to modify the SentencePiece tokenizer's indexing scheme to start from 1, ensuring consistency with the Hugging Face tokenizer?

Thank you!