Closed Lewis-Lu closed 7 months ago
This is expected since spm_encode is a handy C++ wrapper and assumes one input per line. "\n" is used as the delimiter. Use C++ API directly.
https://github.com/google/sentencepiece/blob/master/doc/api.md#tokenize-text-preprocessing
Hi,
using sentencepiece==0.1.99 Currently I found out that the newline '\n' encode result is inconsistent, with the same trained .model, using py and cpp APIs. Could you please look into it. Result as follows:
No extra options added,
Python: $ tokenizer.tokenize("\n") [29871, 13]
$tokenizer.tokenize("\nHow many cars are in the picture, except the ego car?\nAnswer the question with precise number. ASSISTANT:") [29871, 13, 5328, 1784, 18647, 526, 297, 278, 7623, 29892, 5174, 278, 321, 1484, 1559, 29973, 13, 22550, 278, 1139, 411, 18378, 1353, 29889, 319, 1799, 9047, 13566, 29901]
CPP: $ echo "\n" | spm_encode --model=tokenizer.model --output_format='id' 320 29876
$ echo "\nHow many cars are in the picture, except the ego car?\nAnswer the question with precise number. ASSISTANT:" | spm_encode --model=tokenizer.model --output_format='id' 320 29876 5328 1784 18647 526 297 278 7623 29892 5174 278 321 1484 1559 29973 29905 29876 22550 278 1139 411 18378 1353 29889 319 1799 9047 13566 29901