Closed MatthewBieda closed 2 years ago
Sencencepice training requires large heap memory as it loads all input sentences to build suffix array. Please runs the command with the computer, or restrict (sample) the input data with --input_sentence_size
i9 8 core, 256G RAM, huge 10T disk. Corpus 174m sentences, intended to create 50,000 vocab. Keep crashing silently. Why ? What can I do better ?
Steve
Found! For each 10m sentences, you would need appx. 70G of RAM, that is it. I can train 36m sentences for 40,000 vocab with 256G RAM with unigram.
Glad you solved this.
Attempting to train 16 million parallel sentences separately by source (English) and target (Japanese) on an EC2 instance with 500GB RAM.
Command is spm.SentencePieceTrainer.train('--input=English_lowercased_mosespretok.txt --vocab_size=32000 --num_threads=64 --train_extremely_large_corpus=true --model_prefix=English') .
Also tested with default 16 threads and large_corpus=false, training always gets to building the suffix array then crashes after it starts using multiple cores.
However the vocab file and model file are saved regardless, is it ok to use them in production even though the process hangs?