pluskid / Mocha.jl

Deep Learning framework for Julia
Other
1.29k stars 254 forks source link

Interfacing to Mochas backpropagation algorithm #98

Open vollmersj opened 9 years ago

vollmersj commented 9 years ago

Mocha is really nice project and has backpropagation implemented for many different layer types and neurons. However what is the best way to interface it in way to obtain one large parameter vector? Is there a way around copy pieces to every blob? Currently, I am using copy(net.states[i].parameters[j].blob,slice) where slice is a slice of my big parameter vector.

This can be put together to a function that does the backpropagation given an array NNInds containg the indicies of the corresponding slices.

Copying the memory will produce an overhead, this should not matter for large networks, but there must be a better way.

# backpropagation
function evaluateNN(para,nPara,NNInds)
    # loss
    for i = 2:(length(net.states)-1)
        for j = 1:length(net.states[i].parameters)
            copy!(net.states[i].parameters[j].blob,
            para[NNInd[i-1][j][1] : NNInd[i-1][j][2]])
        end
    end
    val = forward(net, solver.params.regu_coef)
    backward(net, solver.params.regu_coef)
    # gradient
    gradient = zeros(nPara)
    for i=2:(length(net.states)-1)
        for j=1:length(net.states[i].parameters)
            gradient[NNInd[i-1][j][1] : NNInd[i-1][j][2]] =
                                            net.states[i].parameters[j].gradient.data[:]
        end
    end
    return (val, gradient)
end
pluskid commented 9 years ago

Unfortunately there is no better way because for obvious reason the gradients are stored separately. I'm quite curious though why do you want to get a flat huge vector? If you really want that, though, a easier way is to flatten each Param and then concat them all. The memory overhead cannot be avoided though.

vollmersj commented 9 years ago

Thank you for your response. There might be a way around the memory overhead by using pointers

x = zeros( 8 )
p = pointer_to_array( pointer( x, 3 ), (3,2) )
p[:,1] = 100.0
p[:,2] = 200.0
@show x    # => [ 0.0, 0.0, 100.0, 100.0, 100.0, 200.0, 200.0, 200.0 ]

Initialising the layer blobs would then require picking an appropriate chunk out of the memory. Would this be possible?

Having one parameter vector makes it easier to try different tuning algorithms.

pluskid commented 9 years ago

Yes, this is technically possible for could backend only. Though I doubt it will be a seriously issue because nowadays cpu memory is very large. If you have huge models, the bottleneck will then become the computation, esp when using a couple backend. For gpu backend, the memory is on the gpu device and cannot be directly shared with cpu.

pluskid commented 9 years ago

For "cpu backend only", sorry I am on a phone and the auto correction is so crazy that it does not know cpu.