Pretraining of albert from scratch is stuck

008karan commented 4 years ago

I am doing pre-training from scratch. It seems that training is started as gpu's are being used but nothing is on terminal except this:

***** Number of cores used :  4 
I0227 09:00:31.841020 140137372948224 run_pretraining.py:226] Training using customized training loop TF 2.0 with distrubutedstrategy.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:44.563593 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:44.569019 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
I0227 09:00:45.620952 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',). 
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:45.625989 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.679141 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:46.684157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.734523 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:47.739573 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.697876 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0227 09:00:57.703157 140137372948224 cross_device_ops.py:427] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:07.835676 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
INFO:tensorflow:batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
I0227 09:01:28.672055 140137372948224 cross_device_ops.py:748] batch_all_reduce: 32 all-reduces with algorithm = nccl, num_packs = 1, agg_small_grads_max_bytes = 0 and agg_small_grads_max_group = 10
2020-02-27 09:01:50.162839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

I tried on smaller text data also but same results. @kamalkraj

josegchen commented 4 years ago

same problem here. 9 GPUs are available - no training at all

008karan commented 4 years ago

I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it. @kamalkraj @josegchen

josegchen commented 4 years ago

Mind you share us the parameters and setting in detail? Sent from my Huawei phone-------- Original message --------From: Karan Purohit notifications@github.comDate: Tue, Mar 3, 2020, 1:52 AMTo: "kamalkraj/ALBERT-TF2.0" ALBERT-TF2.0@noreply.github.comCc: josegchen josegchen@gmail.com, Mention mention@noreply.github.comSubject: Re: [kamalkraj/ALBERT-TF2.0] Pretraining of albert from scratch is stuck (#36)I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it. @kamalkraj @josegchen

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

008karan commented 4 years ago

python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3

josegchen commented 4 years ago

I have tried with an 313MB tf_record file, it works on CPU only.

On Mar 3, 2020, at 7:56 AM, Karan Purohit notifications@github.com wrote:

python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/ALBERT-TF2.0/issues/36?email_source=notifications&email_token=AHIXEC7YQT3EV2OWCZOREMLRFUEAPA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENTSMLY#issuecomment-593962543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIXEC2PHQSGCQXKPKSEM23RFUEAPANCNFSM4K4XMSMA.

008karan commented 4 years ago

have you checked gpu usage? In my case, gpu is utilizing.

josegchen commented 4 years ago

It does show minor gpu utilize, see 1-2% for 7-8 GPUs and 25% for a single GPU for a very very short moment. However, the GPU memory are occupied. It halted eventually with a resource exhausted error.

On Mar 4, 2020, at 1:45 AM, Karan Purohit notifications@github.com wrote:

have you checked gpu usage? In my case, gpu is utilizing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/ALBERT-TF2.0/issues/36?email_source=notifications&email_token=AHIXECYLI3ANUAN22PXWVSLRFYBLNA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENWWXBI#issuecomment-594373509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIXECZEP4VVQXAZRKXJSATRFYBLNANCNFSM4K4XMSMA.

ibrahimishag commented 3 years ago

have you checked gpu usage? In my case, gpu is utilizing.

Dear Karan, I would like to know how did it go? Were you able to pre-train using a single GPU? Please share your experience!.

kamalkraj / ALBERT-TF2.0

Pretraining of albert from scratch is stuck #36