Closed Mintchocolater closed 5 months ago
Cannot be stated clearly due to data and environment dependence. However, the swap memory is extremely slow so generally not recommended. The default Unigram mode (--model_type=unigram) takes a long time to build the suffix array. We can roughly estimate the time with small data, as the training time is O(n).
BPE (--model_type=bpe) may work faster as it doesn't peform suffix array construction.
I have 31.2GB text data, with the length of each sentence is around 12000. I use
spm_train --input='xxx.txt' --model_prefix='xxxfolder' --vocab_size=4096 --character_coverage=0.995 --model_type='bpe' --add_dummy_prefix=False --unk_piece="[UNK]" --control_symbols="[PAD],[MASK]" --bos_piece="[CLS]" --eos_piece="[SEP]" --pad_piece=False --max_sentencepiece_length=16 --vocabulary_output_piece_score=False --max_sentence_length=12600 --num_sub_iterations=2 --num_threads=128 --train_extremely_large_corpus=true
trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator. trainer_interface.cc(178) LOG(INFO) Loading corpus: /data4/Data01/SNP_project/bert_store/dataset/1000_genome_SNP/BPE_tokenizer_data/base_pairs.txt trainer_interface.cc(140) LOG(INFO) Loaded 1000000 lines trainer_interface.cc(140) LOG(INFO) Loaded 2000000 lines trainer_interface.cc(117) LOG(WARNING) Too many sentences are loaded! (2794221), which may slow down training. trainer_interface.cc(119) LOG(WARNING) Consider using --input_sentence_size= and --shuffle_input_sentence=true.
trainer_interface.cc(122) LOG(WARNING) They allow to randomly sample sentences from the entire corpus.
trainer_interface.cc(385) LOG(INFO) Loaded all 2794221 sentences
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [UNK]
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [CLS]
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [SEP]
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [PAD]
trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [MASK]
trainer_interface.cc(405) LOG(INFO) Normalizing sentences...
trainer_interface.cc(466) LOG(INFO) all chars count=33474488718
trainer_interface.cc(477) LOG(INFO) Done: 99.9868% characters are covered.
trainer_interface.cc(487) LOG(INFO) Alphabet size=4
trainer_interface.cc(488) LOG(INFO) Final character coverage=0.999868
trainer_interface.cc(520) LOG(INFO) Done! preprocessed 2794221 sentences.
trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 2794221
trainer_interface.cc(537) LOG(INFO) Done! 2794156
It takes 1.8T memory, so I added some swap space. Currently, it is using 1.4TB of physical memory and 0.4TB of swap. It has been running for almost 24 hours. What is the expected running time?