Closed tomohideshibata closed 4 years ago
After 20 hours later from starting, a log line was outputted, and so it wasn't stuck.
When I used the above tensorflow BERT code, log lines were outputted frequently. I am not sure why these are different.
@tomohideshibata I am facing the same issue, can you suggest what changes I need to solve this issue?
@008karan Hi. As in my above comment, a log line was outputted after 20 hours later from starting. (I changed nothing) I think there is something strange in the pre-training codes.
I have tried to perform pre-training from scratch on GPUs using the following command:
python run_pretraining.py --albert_config_file=albert_config.json --do_train --input_files=/somewhere/*/tf_examples.*.tfrecord --meta_data_file_path=/somewhere/train_meta_data --output_dir=/somewhere --strategy_type=mirror --train_batch_size=128 --num_train_epochs=2
But it seems to be stuck as follows:
GPUs are running, but no outputs are found.
The core of pre-training code is similar to the following tensorflow BERT code, and I have succeeded in running the following pre-training code. https://github.com/tensorflow/models/tree/master/official/nlp/bert
My environment is as follows:
Thanks in advance.