gradParameters *must* be shared when using optim, but must *not* be shared when using updateParameters()

rolfe commented 11 years ago

The methodology required to share parameters between modules is not the same when using the standard Module:updateParameters() versus the updates of the optim package.

Without optim, only the parameters should be shared; the gradients should be independent. If you share the gradients, the shared gradients for each copy of the parameters contain the sum of the gradients over all of the copies. If you then use the standard updateParameters(), the update of each copy of the shared parameters then performs the full update for the shared parameters. Since every copy of the shared parameters is updated separately, the update of the shared parameters is effectively scaled by the number of copies. You need to accumulate the gradients across the shared parameters exactly once, in the parameters themselves, rather than in both the gradients and then again in the parameters.

If you use the optim package, the parameters and gradParameters are first flattened, in the process of which shared parameters are coalesced into a single copy. Gradients must be shared both so the sizes of the parameters and gradParameters match, and because each set of shared, coalesced parameters is only updated once, so the accumulation across shared parameters must be done in the gradients, rather than in the parameters.

This difference is not obvious, and does not appear during Jacobian unit-testing, which does not use the optim framework.

clementfarabet commented 11 years ago

Jason,

Yeah, interesting remark. Since I only use optim, I've never really paid attention to the builtin updateParameters(). This difference should probably be made more clear somewhere.

I find the optim approach to be much more general, and not so confusing in the end: enforcing both vectors to have exactly the same size makes things quite simple. What do you think?

rolfe commented 11 years ago

The main problem with the optim approach is that it breaks the module-based abstraction barrier that is the basis of the nn package. The code must know how parameters are stored across modules. It should be easy enough to modify updateParameters so that it flattens the parameters (only if they haven't been flattened already) and operates on the flattened parameters like in optim. However, you'll get into trouble if you do something unexpected but legal, such as including a single module in more than one network. Using the optim approach, each time you want to update the parameters of a module, you need to flatten the associated network, which allocates new storage and points the modules to it. A single module can't be updated based upon two networks, since the most recent network will steal the representation of the shared module, whereas the older networks will continue to update the old storages.

I don't see any real advantage to allowing optim to directly modify the parameters, unless a single set of parameters is shared across a huge set of modules, in which case traversing the module graph would be time consuming (although the graph presumably must be traversed during updateOutput, so the savings is only a constant factor). So long as sharing is across a constant number of modules (e.g. tens or hundreds, rather than millions), I would think the cleaner abstraction would be to provide access methods in the modules for updating the parameters (e.g. addScaledGradientToParameters(scaleFactor)). This way, the modules own and control their own parameters, rather than having to carefully coordinate ownership between the modules and optim. On the other hand, I think it might make more sense to make the sharing procedure explicitly asymmetric, designate one primary module which actually owns the parameters, share the gradParameters as well, and change updateParameters so that only the primary module updates its parameters based upon the shared gradient. This facilitates second-order and other nonlinear methods in optim, and was presumably one of the prime motivations for the current optim implementation. The parameter tensor and gradParameter tensor are restricted to have the same size, which I agree is more intuitive, but they are stored within the primary module, rather than in optim.

koraykv / optim

gradParameters must be shared when using optim, but must not be shared when using updateParameters() #3

koraykv / optim

gradParameters *must* be shared when using optim, but must *not* be shared when using updateParameters() #3

gradParameters must be shared when using optim, but must not be shared when using updateParameters() #3