google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.26k stars 1.18k forks source link

Unigram training always crashes when making suffix array #702

Closed MatthewBieda closed 2 years ago

MatthewBieda commented 3 years ago

Attempting to train 16 million parallel sentences separately by source (English) and target (Japanese) on an EC2 instance with 500GB RAM.

Command is spm.SentencePieceTrainer.train('--input=English_lowercased_mosespretok.txt --vocab_size=32000 --num_threads=64 --train_extremely_large_corpus=true --model_prefix=English') .

Also tested with default 16 threads and large_corpus=false, training always gets to building the suffix array then crashes after it starts using multiple cores.

However the vocab file and model file are saved regardless, is it ok to use them in production even though the process hangs?

taku910 commented 2 years ago

Sencencepice training requires large heap memory as it loads all input sentences to build suffix array. Please runs the command with the computer, or restrict (sample) the input data with --input_sentence_size

thusinh1969 commented 1 year ago

i9 8 core, 256G RAM, huge 10T disk. Corpus 174m sentences, intended to create 50,000 vocab. Keep crashing silently. Why ? What can I do better ?

Steve

thusinh1969 commented 1 year ago

Found! For each 10m sentences, you would need appx. 70G of RAM, that is it. I can train 36m sentences for 40,000 vocab with 256G RAM with unigram.

MatthewBieda commented 1 year ago

Glad you solved this.