google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.09k stars 1.16k forks source link

Program terminated with an unrecoverable error. #859

Closed RuslanSel closed 1 year ago

RuslanSel commented 1 year ago

I have a strange problem. When running spm.SentencePieceTrainer.Train(" ".join(sys.argv[1:])), model_type: UNIGRAM vocab_size: 600 I get an error unigram_model_trainer.cc(247) LOG(INFO) Making suffix array... unigram_model_trainer.cc(251) LOG(INFO) Extracting frequent sub strings... node_num=26183214 unigram_model_trainer.cc(301) LOG(INFO) Initialized 972301 seed sentencepieces unigram_model_trainer.cc(150) [!std::isnan(score)] Program terminated with an unrecoverable error. Command exited with non-zero status 255 I found the line and word in the text where the error occurs. And if I change one letter in this word to any other letter, there is no error. unigram_model_trainer.cc(247) LOG(INFO) Making suffix array... unigram_model_trainer.cc(251) LOG(INFO) Extracting frequent sub strings... node_num=26183216 unigram_model_trainer.cc(301) LOG(INFO) Initialized 969740 seed sentencepieces trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 443829 trainer_interface.cc(608) LOG(INFO) Done! 290894 What is the reason for this behavior? Thank you in advance.

taku910 commented 1 year ago

Could you try v.0.1.99? This issue is already fixed. https://github.com/google/sentencepiece/issues/851

RuslanSel commented 1 year ago

Thanks.