google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.08k stars 1.16k forks source link

Segmentation fault (core dumped) #954

Closed ivankrylatskoe closed 7 months ago

ivankrylatskoe commented 9 months ago

Hello! I am training a tokenizer on a 100M lines text corpus.

After several minutes of training I get the following output:

trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.
trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.
trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.
trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.
trainer_interface.cc(537) LOG(INFO) all chars count=9521241575
trainer_interface.cc(548) LOG(INFO) Done: 99.995% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=2995
trainer_interface.cc(559) LOG(INFO) Final character coverage=0.99995
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 92271010 sentences.
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 92271010
trainer_interface.cc(608) LOG(INFO) Done! 40063224
Segmentation fault (core dumped)

If I reduce corpus size to 60M lines, I get:

trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.
trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.
trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.
trainer_interface.cc(522) LOG(INFO) Found null character. The corpus must be encoded in utf-8.
trainer_interface.cc(537) LOG(INFO) all chars count=5735232645
trainer_interface.cc(548) LOG(INFO) Done: 99.995% characters are covered.
trainer_interface.cc(558) LOG(INFO) Alphabet size=2801
trainer_interface.cc(559) LOG(INFO) Final character coverage=0.99995
trainer_interface.cc(591) LOG(INFO) Done! preprocessed 55428638 sentences.
trainer_interface.cc(597) LOG(INFO) Tokenizing input sentences with whitespace: 55428638
trainer_interface.cc(608) LOG(INFO) Done! 26912306
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=97904060 min_freq=66
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=32139705 size=20 all=283455 active=17429 piece=or

(and further output without errors).

sentencepiece version: 0.1.99

UPDATE: After searching for error cause, I found that the problem was not corpus size. It was the presence of long word in text.

The following code gives Segmentation fault. Checked on our server and on Google Colab.

import sentencepiece as spm

with open('input_file.txt', 'w') as input_file:
    input_file.write('a'*32769)

spm.SentencePieceTrainer.train(
    input='input_file.txt', 
    input_format="text",
    model_type="bpe",
    model_prefix=f"model",
    vocab_size=1000,
    max_sentence_length=100000,
)
taku910 commented 8 months ago

This error happens because the vocab size is too high. The latest build raises the following exceptions.

    return _sentencepiece.SentencePieceTrainer__TrainFromMap(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Internal: src/trainer_interface.cc(660) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Vocabulary size too high (1000). Please set it to a value <= 13.
taku910 commented 7 months ago

Fixed in v0.2.0