torch / nn

Other
1.35k stars 967 forks source link

How to properly save a model? #1090

Open mbcel opened 7 years ago

mbcel commented 7 years ago

I want to save a model at multiple stages during training. To be able to do this I want it to be small to not quickly run out of space on the disk. Currently when I save a model it takes up more than 8GB on the disk.

I know that the method clearState() exists which makes it way smaller (about 1.6 GB). However always when I call it I can't continue training the model due to a Cuda out of memory error. I should also add that I use the optnet package. I do it as follows:

model:clearState()
optnet.removeOptimization(model)

torch.save(path, model)

local sampleInput = torch
                  .zeros(1, options.channels, options.inputImageSize, options.inputImageSize)
                  :cuda()
optnet.optimizeMemory(model, sampleInput, {inplace = false, mode = "training"})

This saves the model as expected but then on the next model:forward() call it throws a Cuda out of Memory Error:

...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:143: cuda runtime error (2) : out of memory at /home/.../torch/extra/cutorch/lib/THC/generic/THCStorage.cu:65
stack traceback:
    [C]: in function 'resize'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:143: in function 'createIODescriptors'
    ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:187: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:185>
    [C]: in function 'xpcall'
    /home/.../torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/.../torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./trainManager.lua:105: in function 'opfunc'
    /home/.../torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd'

Because I couldn't solve this error my second approach was to just store the weights of the model. Therefore I retrieved the parameters and saved them via

local weights, gradParams = model:getParameters()
torch.save(path, weights)

Then on restart of the model I would load the weights and create the network from scratch:

... -- create model and load weights to savedWeights
local weights, gradParams = model:getParameters()
weights = savedWeights

But then I get a strange error saying: "No algorithms found that would fit in free GPU memory" from the find.lua file from the cudnn package.

toshi-k commented 7 years ago

I think explicit GC may help you. Run collectgarbage() after saving model, and restart training.

FabbyD commented 7 years ago

Doing weights = savedWeights won't load your model. It will simply overwrite your variable called weights and your model will stay unchanged. You might want to do weights:copy(savedWeights) instead. Also, be careful with model:getParameters(), it should only be called once during training as suggested in the doc (it seems like you're calling it everytime you save).

joeyhng commented 7 years ago

Note that the mean and std statistics in BatchNormalization modules are not saved in your method.

tradie commented 7 years ago

The only simple and fast way I found was just copying the weights to a memory-mapped file:

torch.Storage("weights.bin",true,w:size(1)):copy(w:storage())

Then I make a copy of the optimizer state without the tensors in it, save it with torch.save(), then I save the state tensors (e.g. m, v, and denom for adam) just like I did it with the weights. (I haven't used batch normalization with it, so that may need some more stuff to be saved.)

It's very fast, it doesn't waste memory, and it's a bother it's not already available as something built-in.