How long does it take to train 31.2GB text data?

I have 31.2GB text data, with the length of each sentence is around 12000. I use

spm_train --input='xxx.txt' --model_prefix='xxxfolder' --vocab_size=4096 --character_coverage=0.995 --model_type='bpe' --add_dummy_prefix=False --unk_piece="[UNK]" --control_symbols="[PAD],[MASK]" --bos_piece="[CLS]" --eos_piece="[SEP]" --pad_piece=False --max_sentencepiece_length=16 --vocabulary_output_piece_score=False --max_sentence_length=12600 --num_sub_iterations=2 --num_threads=128 --train_extremely_large_corpus=true

trainer_interface.cc(329) LOG(INFO) SentenceIterator is not specified. Using MultiFileSentenceIterator. trainer_interface.cc(178) LOG(INFO) Loading corpus: /data4/Data01/SNP_project/bert_store/dataset/1000_genome_SNP/BPE_tokenizer_data/base_pairs.txt trainer_interface.cc(140) LOG(INFO) Loaded 1000000 lines trainer_interface.cc(140) LOG(INFO) Loaded 2000000 lines trainer_interface.cc(117) LOG(WARNING) Too many sentences are loaded! (2794221), which may slow down training. trainer_interface.cc(119) LOG(WARNING) Consider using --input_sentence_size= and --shuffle_input_sentence=true. trainer_interface.cc(122) LOG(WARNING) They allow to randomly sample sentences from the entire corpus. trainer_interface.cc(385) LOG(INFO) Loaded all 2794221 sentences trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [UNK] trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [CLS] trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [SEP] trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [PAD] trainer_interface.cc(400) LOG(INFO) Adding meta_piece: [MASK] trainer_interface.cc(405) LOG(INFO) Normalizing sentences... trainer_interface.cc(466) LOG(INFO) all chars count=33474488718 trainer_interface.cc(477) LOG(INFO) Done: 99.9868% characters are covered. trainer_interface.cc(487) LOG(INFO) Alphabet size=4 trainer_interface.cc(488) LOG(INFO) Final character coverage=0.999868 trainer_interface.cc(520) LOG(INFO) Done! preprocessed 2794221 sentences. trainer_interface.cc(526) LOG(INFO) Tokenizing input sentences with whitespace: 2794221 trainer_interface.cc(537) LOG(INFO) Done! 2794156

It takes 1.8T memory, so I added some swap space. Currently, it is using 1.4TB of physical memory and 0.4TB of swap. It has been running for almost 24 hours. What is the expected running time?

google / sentencepiece

How long does it take to train 31.2GB text data? #1021