Inconsistency in memory consumption of Resnet-101 libraries

isht7 / pytorch-deeplab-resnet

DeepLab resnet v2 model in pytorch

MIT License

602 stars 117 forks source link

Inconsistency in memory consumption of Resnet-101 libraries #29

Closed omkar13 closed 6 years ago

omkar13 commented 6 years ago

Hi, Thank you @isht7 for writing this code. I am having problems in the memory used by the code. If I use a batch size of 1, the memory consumed is around 7-8 GB. I have only 1 GPU and hence I cannot increase the batch size further. However, when I used this library - https://github.com/speedinghzl/Pytorch-Deeplab which implements Deeplabv2 Resnet 101, the batch size can be increased to 10. Isn't this unusual? Could you tell me of any changes I need to make to your code so that I can increase batch size? My GPU has 11.1 GB of memory. Thanking you in anticipation. Regards, Omkar.

isht7 commented 6 years ago

I guess, you cannot run my code with 11.1 GB of GPU memory. If you modify this line, you can make it fit to lower memory. Change random.uniform(0.5, 1.3) to random.uniform(0.5, 1.1) or some even lower number. Lower it in steps till it fits in your GPU. Also, in the original code which is in caffe, the caffe parameter in the solver file iter_size is 10, which means that gradient is acummulated over 10 iterations and then applied. This is what I have done in my code. It is kind of equivalent to setting the batch_size as 10 (see iter_size at this link). The library that you mentioned might be doing this.

omkar13 commented 6 years ago

But the time taken is not affected by using iter_size. I wanted to increase the batch size in order to decrease the time taken. I am running on a different dataset with 480x853 size images. Maybe that is why I am able to run within 7-8 GB. What image size did you test with? Also, the library mentioned is not using the iter_size trick. It is passing the batch_size parameter (value=10) to the Pytorch DataLoader (I tried using the mentioned library with images of size 321x321 and could run with a batch size of 10).

isht7 commented 6 years ago

I just opened the implementation that you have referred above, and I noticed that it is an edited version of my repository. It is unprofessional of @speedinghzl to use my code without citing my repo even though I have MIT license. If you are talking about this file, then it seems that this version does not implement loss over multiple scales. This allows it to have a higher batch size. @speedinghzl might have done some other modification to my code, which I do not know of.

4F2E4A2E commented 6 years ago

A lot of copy paste ai these days... @isht7 thanks for this repo!

omkar13 commented 6 years ago

Thank you for the responses.