google-research / albert

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Apache License 2.0
3.23k stars 570 forks source link

Extremely large RAM consumption by create_pretraining.py #204

Open sultanovazamat opened 4 years ago

sultanovazamat commented 4 years ago

Hi, everyone! I am trying to train Albert from scratch using the multilingual dataset, which is ~40GB. I have trained the SentencePiece Model without any problems, but when I try to launch the create_pretraining.py script it consumes an extremely large amount of RAM, even 1TB is not enough. So the question is how much memory does it require? And maybe the reason is related to the presence of non-Latin languages in the dataset? Thanks!

jeisinge commented 4 years ago

We also ran into this issue as well. We solved it by splitting the corpus into ~ 100MB files and then running them as separate processes through a caller script that ran one process per CPU on our system. Each process used ~ 4 GB of memory.

We found https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor to be useful to manage running one process right after another with a pool size of the number of CPUs.

sultanovazamat commented 4 years ago

We also ran into this issue as well. We solved it by splitting the corpus into ~ 100MB files and then running them as separate processes through a caller script that ran one process per CPU on our system. Each process used ~ 4 GB of memory.

We found https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor to be useful to manage running one process right after another with a pool size of the number of CPUs.

Hi! @jeisinge Thanks for the great solution! Could you please exact time spent on processing one chunk (~100mb)?

jeisinge commented 4 years ago

We are taking about 25 minutes for one file --- obviously, this is a embarassingly paralleizable task.

Also, we adjust the parameters a bit due to having a very large corpus. We felt that we didn't need to augment our data. It is not clear, yet, if this was a good idea. The command we are running looks like:

python -m create_pretraining_data
--input_file=/in/part-00033-tid-2788136423935398351-765d3136-2064-43bd-831b-ed3e65a30183-5151-1-c000.txt
--output_file=/out/part-00033-tid-2788136423935398351-765d3136-2064-43bd-831b-ed3e65a30183-5151-1-c000.txt.tfrecord
--vocab_file=/albert/assets/30k-clean.vocab
--spm_model_file=/albert/assets/30k-clean.model
--max_seq_length=256
--dupe_factor=1
--masked_lm_prob=0.15
--max_predictions_per_seq=38

The most important aspects for data size, I believe, are dupe_factor and max_seq_length.