Open 008karan opened 4 years ago
same problem here. 9 GPUs are available - no training at all
I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it. @kamalkraj @josegchen
Mind you share us the parameters and setting in detail? Sent from my Huawei phone-------- Original message --------From: Karan Purohit notifications@github.comDate: Tue, Mar 3, 2020, 1:52 AMTo: "kamalkraj/ALBERT-TF2.0" ALBERT-TF2.0@noreply.github.comCc: josegchen josegchen@gmail.com, Mention mention@noreply.github.comSubject: Re: [kamalkraj/ALBERT-TF2.0] Pretraining of albert from scratch is stuck (#36)I have tested on very small data (100kb) then it was showing results after the end of each epoch. I want to see results at every step. As on bigger data set its taking time so printing out at every step is required. I am still not able to figure out how to do it. @kamalkraj @josegchen
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.
python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3
I have tried with an 313MB tf_record file, it works on CPU only.
On Mar 3, 2020, at 7:56 AM, Karan Purohit notifications@github.com wrote:
python run_pretraining.py --albert_config_file=model_configs/base/config.json --do_train --input_files=albert/* --meta_data_file_path=meta_data --output_dir=model_checkpoint/ --strategy_type=mirror --train_batch_size=8 --num_train_epochs=3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/ALBERT-TF2.0/issues/36?email_source=notifications&email_token=AHIXEC7YQT3EV2OWCZOREMLRFUEAPA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENTSMLY#issuecomment-593962543, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIXEC2PHQSGCQXKPKSEM23RFUEAPANCNFSM4K4XMSMA.
have you checked gpu usage? In my case, gpu is utilizing.
It does show minor gpu utilize, see 1-2% for 7-8 GPUs and 25% for a single GPU for a very very short moment. However, the GPU memory are occupied. It halted eventually with a resource exhausted error.
On Mar 4, 2020, at 1:45 AM, Karan Purohit notifications@github.com wrote:
have you checked gpu usage? In my case, gpu is utilizing.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kamalkraj/ALBERT-TF2.0/issues/36?email_source=notifications&email_token=AHIXECYLI3ANUAN22PXWVSLRFYBLNA5CNFSM4K4XMSMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENWWXBI#issuecomment-594373509, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIXECZEP4VVQXAZRKXJSATRFYBLNANCNFSM4K4XMSMA.
have you checked gpu usage? In my case, gpu is utilizing.
Dear Karan, I would like to know how did it go? Were you able to pre-train using a single GPU? Please share your experience!.
I am doing pre-training from scratch. It seems that training is started as gpu's are being used but nothing is on terminal except this:
I tried on smaller text data also but same results. @kamalkraj