Open mbcel opened 7 years ago
I think explicit GC may help you.
Run collectgarbage()
after saving model, and restart training.
Doing weights = savedWeights
won't load your model. It will simply overwrite your variable called weights
and your model will stay unchanged. You might want to do weights:copy(savedWeights)
instead. Also, be careful with model:getParameters()
, it should only be called once during training as suggested in the doc (it seems like you're calling it everytime you save).
Note that the mean and std statistics in BatchNormalization modules are not saved in your method.
The only simple and fast way I found was just copying the weights to a memory-mapped file:
torch.Storage("weights.bin",true,w:size(1)):copy(w:storage())
Then I make a copy of the optimizer state without the tensors in it, save it with torch.save()
, then I save the state tensors (e.g. m
, v
, and denom
for adam) just like I did it with the weights. (I haven't used batch normalization with it, so that may need some more stuff to be saved.)
It's very fast, it doesn't waste memory, and it's a bother it's not already available as something built-in.
I want to save a model at multiple stages during training. To be able to do this I want it to be small to not quickly run out of space on the disk. Currently when I save a model it takes up more than 8GB on the disk.
I know that the method clearState() exists which makes it way smaller (about 1.6 GB). However always when I call it I can't continue training the model due to a Cuda out of memory error. I should also add that I use the optnet package. I do it as follows:
This saves the model as expected but then on the next model:forward() call it throws a Cuda out of Memory Error:
Because I couldn't solve this error my second approach was to just store the weights of the model. Therefore I retrieved the parameters and saved them via
Then on restart of the model I would load the weights and create the network from scratch:
But then I get a strange error saying: "No algorithms found that would fit in free GPU memory" from the find.lua file from the cudnn package.