"RuntimeError: reduce failed to synchronize: device-side assert triggered" while trainging

QW-Is-Here commented 5 years ago

Traceback (most recent call last): File "sparsity_train.py", line 154, in train() File "sparsity_train.py", line 100, in train loss = model(imgs, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, kwargs) File "/home/chensy/QW/yolov3-network-slimming/yolomodel.py", line 352, in forward x, losses = self.module_list[i][0](x, targets) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(input, kwargs) File "/home/chensy/QW/yolov3-network-slimming/yolomodel.py", line 133, in forward loss_conf = self.bce_loss(pred_conf[conf_mask_false], tconf[conf_mask_false]) + \ File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call result = self.forward(*input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/loss.py", line 512, in forward return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 2113, in binary_cross_entropy input, target, weight, reduction_enum) RuntimeError: reduce failed to synchronize: device-side assert triggered

violet17 commented 5 years ago

I met the same problem. The training log show that the width is very very large.

violet17 commented 5 years ago

@MrWangg1992 Hi, how do you solve the this problem? Can you share it with me?

QW-Is-Here commented 5 years ago

@violet17 I changed the BCELoss command to BCEWithLogitsLoss

violet17 commented 5 years ago

@MrWangg1992 Thanks. I changed it too, but it can't help. loss became nan.

QW-Is-Here commented 5 years ago

@violet17 Did you changed the batch and subdivison size in the .cfg file as well ?

violet17 commented 5 years ago

@MrWangg1992 I changed them to 1 because of CUDA always out of memory.

QW-Is-Here commented 5 years ago

@violet17 你看看你的cfg配置文件选的是test还是train，batch和subdivisions等于1的时候是没有效果的. Out of memory 就只能调batch size 这些的调小点

violet17 commented 5 years ago

@MrWangg1992 我把batch和subdivisions都设成1了，不能这样么？为啥？

talebolano / yolov3-network-slimming

"RuntimeError: reduce failed to synchronize: device-side assert triggered" while trainging #22