google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.07k stars 1.16k forks source link

How long does it take to train 31.2GB text data? #1021

Closed Mintchocolater closed 3 months ago

Mintchocolater commented 3 months ago

I have 31.2GB text data, with the length of each sentence is around 12000. I use

spm_train --input='xxx.txt' --model_prefix='xxxfolder' --vocab_size=4096 --character_coverage=0.995 --model_type='bpe' --add_dummy_prefix=False --unk_piece="[UNK]" --control_symbols="[PAD],[MASK]" --bos_piece="[CLS]" --eos_piece="[SEP]" --pad_piece=False --max_sentencepiece_length=16 --vocabulary_output_piece_score=False --max_sentence_length=12600 --num_sub_iterations=2 --num_threads=128 --train_extremely_large_corpus=true

trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator. trainer_interface.cc(178) LOG(INFO) Loading corpus: /data4/Data01/SNP_project/bert_store/dataset/1000_genome_SNP/BPE_tokenizer_data/base_pairs.txt trainer_interface.cc(140) LOG(INFO) Loaded 1000000 lines trainer_interface.cc(140) LOG(INFO) Loaded 2000000 lines trainer_interface.cc(117) LOG(WARNING) Too many sentences are loaded! (2794221), which may slow down training. trainer_interface.cc(119) LOG(WARNING) Consider using --input_sentence_size= and --shuffle_input_sentence=true. trainer_interface.cc(122) LOG(WARNING) They allow to randomly sample sentences from the entire corpus. trainer_interface.cc(385) LOG(INFO) Loaded all 2794221 sentences trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [UNK] trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [CLS] trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [SEP] trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [PAD] trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [MASK] trainer_interface.cc(405) LOG(INFO) Normalizing sentences... trainer_interface.cc(466) LOG(INFO) all chars count=33474488718 trainer_interface.cc(477) LOG(INFO) Done: 99.9868% characters are covered. trainer_interface.cc(487) LOG(INFO) Alphabet size=4 trainer_interface.cc(488) LOG(INFO) Final character coverage=0.999868 trainer_interface.cc(520) LOG(INFO) Done! preprocessed 2794221 sentences. trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 2794221 trainer_interface.cc(537) LOG(INFO) Done! 2794156

It takes 1.8T memory, so I added some swap space. Currently, it is using 1.4TB of physical memory and 0.4TB of swap. It has been running for almost 24 hours. What is the expected running time?

taku910 commented 3 months ago

Cannot be stated clearly due to data and environment dependence. However, the swap memory is extremely slow so generally not recommended. The default Unigram mode (--model_type=unigram) takes a long time to build the suffix array. We can roughly estimate the time with small data, as the training time is O(n).

BPE (--model_type=bpe) may work faster as it doesn't peform suffix array construction.