google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.07k stars 1.16k forks source link

decode token one by one #1044

Closed nigelzzz closed 3 weeks ago

nigelzzz commented 3 weeks ago

Hi @taku910 , base on this case https://github.com/google/sentencepiece/issues/1043 I got it, thanks, because i see other llm appilication decode token one by one. if i need to implement it, do you have any suggestion

taku910 commented 3 weeks ago

Probably we could decode the id directly with id_to_piece, though it is not always the same as decode method.

>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor(model_file='test_model.model')
>>> ids = sp.encode('hello world. sentencepiece is a language independent tokenizer.')
>>> ids
[39, 88, 21, 887, 6, 331, 15, 256, 29, 25, 16, 135, 47, 11, 960, 20, 981, 109, 10, 46, 98, 25, 997, 40, 6]
>>> s = ''
>>> for id in ids:
...     s += sp.id_to_piece(id)
...     print(s.replace('▁', ' ').lstrip(' '))
... 
he
hell
hello
hello world
hello world.
hello world. sen
hello world. sent
hello world. sentence
hello world. sentencep
hello world. sentencepi
hello world. sentencepie
hello world. sentencepiece
hello world. sentencepiece is
hello world. sentencepiece is a
hello world. sentencepiece is a language
hello world. sentencepiece is a language in
hello world. sentencepiece is a language independ
hello world. sentencepiece is a language independent
hello world. sentencepiece is a language independent to
hello world. sentencepiece is a language independent tok
hello world. sentencepiece is a language independent token
hello world. sentencepiece is a language independent tokeni
hello world. sentencepiece is a language independent tokeniz
hello world. sentencepiece is a language independent tokenizer
hello world. sentencepiece is a language independent tokenizer.
>>>