google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.07k stars 1.16k forks source link

decode one by one can't show space #1043

Closed nigelzzz closed 3 weeks ago

nigelzzz commented 3 weeks ago

Hi , when i decode token one by one, it can't show space, but when i decode token id vector, it can show space correctly

std::string output_str;
    if (sp_processor->Decode({next_token}, &output_str).ok())
    {
        std::cout << output_str<< std::flush;
    }

the output is like Dear[Name],Ihopethisemailfindsyouwell.

but when i append token id to a vector, then decode it once, e.g.,

(sp_processor->Decode(output_tokens, &output_text).ok());

it can show Dear [Name], I hope this email finds you well.

taku910 commented 3 weeks ago

This is expected. Sentencepiece doesn't have the knowledge that next_token is the word, and output_tokens are the sentence. The white-spaces between words are preserved in the decoded output.

nigelzzz commented 3 weeks ago

Hi @taku910 , I got it, thanks, because i see other llm appilication decode token one by one. if i need to implement it, do you have any suggestion