Question when use apex - Githubissues

South-Twilight commented 10 months ago

Hi, when I use apex to train a model on 4 gpus and batch_size=16 in config.yaml, the train.log shows below:

[train]:   6%|▋         | 157283/2500000 [00:07<5077:11:43,  7.80s/it]
[train]:   6%|▋         | 157283/2500000 [00:07<5062:26:00,  7.78s/it]
[train]:   6%|▋         | 157283/2500000 [00:07<4995:54:26,  7.68s/it]
[train]:   6%|▋         | 157283/2500000 [00:07<5001:31:14,  7.69s/it]
[train]:   6%|▋         | 157284/2500000 [00:09<2837:29:46,  4.36s/it]
[train]:   6%|▋         | 157284/2500000 [00:09<2843:41:43,  4.37s/it]
[train]:   6%|▋         | 157284/2500000 [00:09<2810:08:46,  4.32s/it]
[train]:   6%|▋         | 157284/2500000 [00:09<2812:15:34,  4.32s/it]
[train]:   6%|▋         | 157285/2500000 [00:11<2025:45:21,  3.11s/it]
[train]:   6%|▋         | 157285/2500000 [00:11<2029:06:23,  3.12s/it]
[train]:   6%|▋         | 157285/2500000 [00:11<2010:52:27,  3.09s/it]
[train]:   6%|▋         | 157285/2500000 [00:11<2012:00:48,  3.09s/it]

I'm not sure that the checkpoint-150000steps.pkl means train the model either 150000*4=600000steps && batch_size=16 or 150000steps && batch_size=16*4=64 .

I'm looking forward to your reply.

kan-bayashi commented 10 months ago

150000steps && batch_size=16*4=64

This one.

South-Twilight commented 10 months ago

Thanks a lot.

kan-bayashi / ParallelWaveGAN

Question when use apex #416