Closed ivankrylatskoe closed 8 months ago
This error happens because the vocab size is too high. The latest build raises the following exceptions.
return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: src/trainer_interface.cc(660) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (1000). Please set it to a value <= 13.
Fixed in v0.2.0
Hello! I am training a tokenizer on a 100M lines text corpus.
After several minutes of training I get the following output:
If I reduce corpus size to 60M lines, I get:
(and further output without errors).
sentencepiece version: 0.1.99
UPDATE: After searching for error cause, I found that the problem was not corpus size. It was the presence of long word in text.
The following code gives Segmentation fault. Checked on our server and on Google Colab.