Training does not converge for resnet50-ssd on pascal VOC dataset

jay-mahadeokar / pynetbuilder

pyNetBuilder is a modular pytonic interface with builtin modules for generating popular caffe prototxt network file definitions.

BSD 2-Clause "Simplified" License

328 stars 140 forks source link

Training does not converge for resnet50-ssd on pascal VOC dataset #2

Closed kristellmarisse closed 8 years ago

kristellmarisse commented 8 years ago

I am traing SSD-Resnet50 on pascal VOC dataset. SInce I have a smaller GPU (gtx960 4gb), I reduced the batch size to train. The training loss started at 14 and after 7k iterations it went down to 7. But after that the loss doesn't seem to reduce. Is it because of changing the batch size ?

jay-mahadeokar commented 8 years ago

What is your batch size? I get best results with batch size of 32 (8 per gpu * 4 gpus in parallel), also found that batch size as low as 14 also converges, though results are not best. I would also look at running avg training loss to see whats happening, see the training plot here.

kristellmarisse commented 8 years ago

Thank you for the leads. My batch size was only 2 (that was the best I can squeeze into my GPU memory). Is it ok if I increase the batch size by modifying the iter_size parameter in solver.prototxt. I usually use this trick in py-faster-rcnn.

jay-mahadeokar commented 8 years ago

@kristellmarisse I have not tried that setting. Maybe you could try using a smaller network for bigger batch size? See few other resnet models shared here pretrained on imagenet, which give decent top-1 accuracy. The # params field in the comparison tables will influence the model size.

kristellmarisse commented 8 years ago

Thank you for sharing more models.

kristellmarisse commented 8 years ago

By the way, can you share me your GPU specs on which you trained the Resnet+SSD?

jay-mahadeokar commented 8 years ago

I think its K80 it has 11GB memory.