Training can get stuck - Githubissues

wen0320 commented 1 year ago

Hello when batch_size=4, mp.spawn(main, args=(2, args), nprocs=2), train_ds, test_ds = dataloader.load_data(args.data_dir, args.batch_size, 8, args.cut_len),When training to the period of 1epoch, GPU=100%, training will be stuck, may I ask you have encountered this, how to solve?

SherifAbdulatif commented 1 year ago

So you increase the batch size from 4 to 8 in testing as your question is not clear train_ds, test_ds = dataloader.load_data(args.data_dir, args.batch_size, 8, args.cut_len)? 4 is the maximum we tested 3 can also reproduce very similar results in case you have a limited GPU.

wen0320 commented 1 year ago

First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:

When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?

coreeey commented 1 year ago

First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:

When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?

have you reproduced the result now?

taqta commented 11 months ago

First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:

When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?

I meet the same problem. Training stuck in epoch 0, step 500 / 726

taqta commented 11 months ago

是这段代码引起训练阻塞由于loss_metric取值为NONE的时候，进程就会进入else判断，不会进行backward，我推测这会导致gpu之间的不同步。所以只需要在loss_metric取值为none的时候将loss_metric和grad变为0就可以等效替代if else。因此可以把生成loss_metric函数做以下更改：这样就可以避免训练卡住的问题了

wen0320 commented 11 months ago

是这段代码引起训练阻塞由于loss_metric取值为NONE的时候，进程就会进入else判断，不会进行backward，我推测这会导致gpu之间的不同步。所以只需要在loss_metric取值为none的时候将loss_metric和grad变为0就可以等效替代if else。因此可以把生成loss_metric函数做以下更改：这样就可以避免训练卡住的问题了

感谢大佬，确实可行。

ruizhecao96 / CMGAN

Training can get stuck #34