google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.25k stars 1.17k forks source link

Inconsistent result between py and cpp #992

Closed Lewis-Lu closed 7 months ago

Lewis-Lu commented 7 months ago

Hi,

using sentencepiece==0.1.99 Currently I found out that the newline '\n' encode result is inconsistent, with the same trained .model, using py and cpp APIs. Could you please look into it. Result as follows:

No extra options added,

Python: $ tokenizer.tokenize("\n") [29871, 13]

$tokenizer.tokenize("\nHow many cars are in the picture, except the ego car?\nAnswer the question with precise number. ASSISTANT:") [29871, 13, 5328, 1784, 18647, 526, 297, 278, 7623, 29892, 5174, 278, 321, 1484, 1559, 29973, 13, 22550, 278, 1139, 411, 18378, 1353, 29889, 319, 1799, 9047, 13566, 29901]

CPP: $ echo "\n" | spm_encode --model=tokenizer.model --output_format='id' 320 29876

$ echo "\nHow many cars are in the picture, except the ego car?\nAnswer the question with precise number. ASSISTANT:" | spm_encode --model=tokenizer.model --output_format='id' 320 29876 5328 1784 18647 526 297 278 7623 29892 5174 278 321 1484 1559 29973 29905 29876 22550 278 1139 411 18378 1353 29889 319 1799 9047 13566 29901

taku910 commented 7 months ago

This is expected since spm_encode is a handy C++ wrapper and assumes one input per line. "\n" is used as the delimiter. Use C++ API directly.

https://github.com/google/sentencepiece/blob/master/doc/api.md#tokenize-text-preprocessing