Open SeanCho1996 opened 3 years ago
I tried to update Pytorch version to 1.4.0, while torchvision is 0.5.0, and this problem is solved. Maybe need to modify the script
dependencies={
ModelDependency.TORCH: '1.0.1',
ModelDependency.TORCHVISION: '0.2.2',
},
to the newest version of torch?
Thanks for the information. I see your script is using batch size of 256 that cannot fit into a single GPU. Normally a VGG (VGG11BN) is very large so we use a small batch size (32 in this case)
If it is for CPU only, we can use batch size of 256 for VGG. In other words, the batch size of 256 is for CPU
I was trying to train a VGG model with singa-auto on a local environment, the training python script is PyPandaVgg.py:
My initial trial number was set to 7, and the Time_hour set to 0.5h, but as the process went on, when the comes to the 4th or 5th trial, the graphic memory occupation raised to 10169Mb and keep rising, then the process will crash because of graphic memory overflow.
When I attempted to debug th whole procedure, I found that the potential problem was in the
dev.py
file, more precisely, in thetune_model
function, trial loop:At the end of each loop, the
destroy
function is called to delete the temporarily mounted model in the graphic memory, but as I digged into this function inmodel.py
,I found that this function is empty, and apparently the model is not deleted, so my question is whether this function is not finished yet or there was something wrong with my comprehension? Thank you.