google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.08k stars 1.16k forks source link

RuntimeError #965

Closed fkurushin closed 7 months ago

fkurushin commented 8 months ago
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fkurushin/entity-classification/venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 989, in Train
    SentencePieceTrainer._Train(arg=arg, **kwargs)
  File "/home/fkurushin/entity-classification/venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 982, in _Train
    return SentencePieceTrainer._TrainFromMap(new_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fkurushin/entity-classification/venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 927, in _TrainFromMap
    return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: src/trainer_interface.cc(661) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (500000). Please set it to a value <= 455361.

when I set to 400 000

trainer_interface.cc(686) LOG(INFO) Saving model: m.model
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fkurushin/entity-classification/venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 989, in Train
    SentencePieceTrainer._Train(arg=arg, **kwargs)
  File "/home/fkurushin/entity-classification/venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 982, in _Train
    return SentencePieceTrainer._TrainFromMap(new_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fkurushin/entity-classification/venv/lib/python3.11/site-packages/sentencepiece/__init__.py", line 927, in _TrainFromMap
    return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: src/trainer_interface.cc(661) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (400000). Please set it to a value <= 334995.

Can anyone explain why we have this limit?

taku910 commented 7 months ago

Dup https://github.com/google/sentencepiece/issues/954