Question re gamma choice

KirkHadley commented 9 years ago

I just implemented this in python (for a neural net so slightly different + looks like it requires more inner loop passes for EMSO-G/CD than usual) and was just wondering how you came to 100 as a the default value for gamma? In the paper they said they searched over a massive range, and I was just wondering you'd replicated their results/if it was arbitrary.

Thanks, Kirk

rcmccartney commented 9 years ago

So in the paper I believe they searched from 10^0 to 10^5 to find the optimal value of gamma. As the default, I just split the difference and went with 10^2, pretty much arbitrary. You can see that for this I actually take in quite a few command line arguments for each run of the program so the default values are just there to allow me to focus on one at a time, for instance leaving gamma alone while I optimize for alpha. We were using the URL data that I just updated the README about, and the default value seemed to work nicely there, but it could use some more experimentation. In theory, gamma should change over time which is something I'd like to add to this for the future, the downside being yet more parameters required...

KirkHadley commented 9 years ago

Ah ok, thanks for explaining. I too assumed that one would want to employ a method for increasing the size of gamma over iterations. I ended up deciding to fix it in my implementation, out of the same desire to limit the number of hyper-parameters to tune. Thinking more about it, I wonder if such a strategy would actually result in much benefit if you use a standard momentum/learning rate decay in your SGD? I mean decreasing your learning rate over training already has the effect of constraining the size of an update as you (ideally) approach a global minimum, and gamma's purpose is just a constraint to prevent large updates from a mini-batch, so wouldn't they be redundant?

Also what're your thoughts on ESMO-GD on neural nets?

rcmccartney commented 9 years ago

That's a really good point, since they are multiplying the gamma term in (10) by the decaying learning rate I'd agree that would probably be sufficient to ensure good convergence. I suppose you could come up with counter-examples where you don't want the alpha and the batch update terms to decay at the same rate for non-linear relationships, but in practice I think you'd be introducing more problems and complexity than would pay off.

You said you were using python right? So are you using the multiprocessing module or something else? I think it depends on what type of parallel program you are using. For me, I had a really large weight vector (~3 million) and nodes that were connected over a local network, so I found EMSO really helpful for reducing the amount of traffic over the sockets. That internode communication was really the limiting factor in the convergence time. If you are using multi-threading on a single node then EMSO might not bring a huge improvement in speed or accuracy, as you can synchronize between processes without much overhead. So I understand, you are running batches of data through the network and then accumulating the output gradient along with the ESMO batch penalty term at the output layer? Then are you backpropagating a portion of the EMSO penalty to each hidden layer node? And if you are doing this in parallel on different nodes with different batches, are you then averaging the updated weights for all the nodes into new global weights to start again on a new batch? That might not be what you did, I'm just curious as to your method. It sounds like a really interesting project, but I would also try implementing a stochastic training algorithm to compare against the batch version. You may find that converges just as fast or faster without the extra complexity.

KirkHadley commented 9 years ago

Actually, right after I sent my previous comment, I started wondering if perhaps there would be value in changing gamma independently across nodes. i.e. introduce the ability to alter the gamma value depending on the output specific to that instance, such that if you had 12 nodes, and node 2 was consistently producing erratic weight updates, give it a larger gamma value. I'm not sure what would cause such a situation to arise, however, so I might solving a non-existent problem.

As for my implementation, yes I'm using python with multiprocessing. Right now, I'm running EMSO on a single node for dev/testing (will be deploying in a distributed environment in the nearish future), but it's still converging faster and more accurately than either vanilla mini-batch, traditional parallelized minibatch (i.e. parallel mini-batches w/ an update, but without the conservative subproblem idea), regular SGD, and a couple other neural-net specific parallelizations. There have been a few ideas about ways to parallelize neural networks that tend to focus on splitting training by layers (not effective), nodes (too much communication), or vertically splitting the network in half (odd idea, also fails with recurrent nets). In my specific case, I'm using a deep LSTM bookended with deep projection and output layers. It's similar to what Bengio and others talk about here [0], but uses an LSTM in the middle instead of a vanilla RNN. Given the depth and complexity of the network, I got all kinds of erratic updates trying to parallelize training by splitting the network itself. I also dislike data parallelization methods, because why train a bunch of models if you don't need to?

I'll use a simpler 2-hidden-layer feed-forward model as an example to answer your questions. I average weights over the entire neural net, not per layer. I forward-activate and backprop over all nodes separately, then average weights over all layers after each iteration. For the w_t-1 in the second component of (4), I use the previous global weight, which I think constrains things better. Across multiple nodes, I'll essentially use this same method, with different batches on different nodes. The biggest differences between my method and Smola's are that I use much more than 5 passes on each batch, and I use the global w_t-1 instead of the local one.

rcmccartney / parallel_regression

Question re gamma choice #1