talebolano / yolov3-network-slimming

yolov3 network slimming剪枝的一种实现
345 stars 93 forks source link

尝试多GPU训练后loss上涨 #33

Closed RyanLBWoods closed 5 years ago

RyanLBWoods commented 5 years ago

我用DataParallel来进行多GPU训练的时候,只有total loss,recal和precision看起来是在正常波动的, 其他的loss(x, y, w, h, conf, cls)看起来都在一直累加似的上涨。

[Epoch 0/2000, Batch 1/5864] [Losses: x 0.591932, y 0.528098, w 0.746021, h 0.659080, conf 1.570114, cls 4.276174, total 2.265707, recall: 0.69901, precision: 0.04759]
[Epoch 0/2000, Batch 2/5864] [Losses: x 0.857584, y 0.785161, w 1.098811, h 0.966139, conf 2.190402, cls 6.412073, total 1.969376, recall: 0.73333, precision: 0.04094]
[Epoch 0/2000, Batch 3/5864] [Losses: x 1.102311, y 1.012907, w 1.538904, h 1.343006, conf 2.772152, cls 8.552397, total 2.005754, recall: 0.72408, precision: 0.04118]
[Epoch 0/2000, Batch 4/5864] [Losses: x 1.368159, y 1.248976, w 1.888377, h 1.623722, conf 3.388307, cls 10.692499, total 1.944181, recall: 0.78692, precision: 0.05367]
[Epoch 0/2000, Batch 5/5864] [Losses: x 1.633099, y 1.499638, w 2.118895, h 1.983082, conf 4.027135, cls 12.831954, total 1.941882, recall: 0.69855, precision: 0.03878]
[Epoch 0/2000, Batch 6/5864] [Losses: x 1.869458, y 1.737572, w 2.342767, h 2.216016, conf 4.740722, cls 14.973027, total 1.892880, recall: 0.72807, precision: 0.04427]
[Epoch 0/2000, Batch 7/5864] [Losses: x 2.126234, y 1.999815, w 2.588304, h 2.463250, conf 5.330129, cls 17.117223, total 1.872695, recall: 0.69712, precision: 0.04138]
[Epoch 0/2000, Batch 8/5864] [Losses: x 2.410720, y 2.238643, w 2.903877, h 2.808697, conf 6.035615, cls 19.263997, total 2.018299, recall: 0.63497, precision: 0.03782]
[Epoch 0/2000, Batch 9/5864] [Losses: x 2.672090, y 2.490939, w 3.304709, h 3.146332, conf 6.816026, cls 21.392921, total 2.080733, recall: 0.77885, precision: 0.05805]
[Epoch 0/2000, Batch 10/5864] [Losses: x 2.907513, y 2.714998, w 3.519198, h 3.365142, conf 7.449289, cls 23.531282, total 1.832202, recall: 0.76154, precision: 0.04329]
[Epoch 0/2000, Batch 11/5864] [Losses: x 3.195225, y 2.953113, w 3.833658, h 3.722046, conf 8.108175, cls 25.690390, total 2.007593, recall: 0.65499, precision: 0.04191]
[Epoch 0/2000, Batch 12/5864] [Losses: x 3.477309, y 3.211058, w 4.125373, h 3.966005, conf 8.913965, cls 27.824598, total 2.007851, recall: 0.76720, precision: 0.04063]
[Epoch 0/2000, Batch 13/5864] [Losses: x 3.751552, y 3.462639, w 4.544926, h 4.246683, conf 9.675089, cls 29.995962, total 2.079271, recall: 0.66534, precision: 0.05127]
[Epoch 0/2000, Batch 14/5864] [Losses: x 3.988789, y 3.692157, w 4.929888, h 4.622669, conf 10.249074, cls 32.140991, total 1.973358, recall: 0.73635, precision: 0.05072]
[Epoch 0/2000, Batch 15/5864] [Losses: x 4.210305, y 3.945491, w 5.236189, h 4.965648, conf 10.837334, cls 34.282791, total 1.927095, recall: 0.71244, precision: 0.04896]

我添加了一个多GPU判断的代码:

if cuda:
    if torch.cuda.device_count() > 1:
        model = nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count())))
        optimizer = nn.DataParallel(model, device_ids=list(range(torch.cuda.device_count())))
    model = model.cuda()
    optimizer = optimizer.cuda()

然后我把loss和optimizer更新的代码改成了:

optimizer.module.zero_grad()
loss = model(imgs, targets)
#loss.sum().backward()
loss.mean().backward()
optimizer.module.step()

所有对应model的地方我也全都改成了model.module。求大佬帮忙看看。