johnmyleswhite / DG2012Tutorial.jl

Simple examples of SGD-style computations in Julia
8 stars 2 forks source link

Convergence Diagnostics #5

Open johnmyleswhite opened 12 years ago

johnmyleswhite commented 12 years ago

Looking at the toy example, it's clear that the model converges as well as it can very quickly. We should set up a way to have the SGD algorithm run until convergence, which may take less than a full pass or may take many passes.

There's also a deeper issue, which is that the model doesn't really properly converge in the absence of large batch processing of at least one large-scale minibatch: using single rows at a time, it just reaches an area around the Maximum Likelihood Estimate and wanders. You can see this very clearly in the toy data logging results: the parameters shift around the MLE over and over again, but their center is exactly the MLE.

StefanKarpinski commented 12 years ago

Two ideas:

  1. We could use a meta-algorithm (e.g. run in batches and take the median of results for the batches).
  2. Would randomization of the stream help? Since we talked about that issue the other day, I was thinking about doing random seeking in the file to get better randomization. One approach would be to pick N random locations in the fail, fast-forward each of them to record start, and then round-robin or randomly pull element from each. That way you're interleaving data from multiple points in the file. It's not perfectly randomized, but it's pretty decent.
johnmyleswhite commented 12 years ago
  1. Aggregating batches is totally reasonable and is something that might be advisable for pragmatic users, although it will probably freak out academic users who will ask for a proof of convergence. Still worth considering, since I'm sure it will give much more robust results.
  2. Randomization of the stream should help a bit, but the toy example is fully randomized already, so randomization isn't sufficient to make up for the absence of richer estimates of the gradient or access to information about the curvature of the function.

I'm guessing the problem is just that SGD is a little hacky. I need to do some more reading, since I know there are hacks to get better results. I need to think more about whether the standard results actually promise reaching the right coefficients or whether they just promise reaching the right predictions. I also still need to let the MSD example run longer to see how well it does when it's left running for ten or twenty passes through the data.

johnmyleswhite commented 12 years ago

After letting the real example run for multiple passes, it's clear that it is converging -- just very, very slowly. Initializing the intercept term to the mean of the data set speeds things up considerably, but we're still looking at many passes through to hit the global optimum. I'll see how easily I can get second order SGD to work. If it's hard, I say we just use this an illustrative example.

StefanKarpinski commented 12 years ago

Aggregating batches is totally reasonable and is something that might be advisable for pragmatic users, although it will probably freak out academic users who will ask for a proof of convergence. Still worth considering, since I'm sure it will give much more robust results.

If each batch is supposed to converge, then surely the median of many batches converges more robustly, no?

johnmyleswhite commented 12 years ago

I think that's probably right, but there are some subtleties in the argument I don't have time to work out right now since this is just a simple example for a tutorial.

Having done tests, I think the problem is something else. If you initialize the model to the global optimum as reported by R, then let it run, it robustly heads away to an alternative position, which is also the position it converges to if you initialize it to a reasonable starting set of parameters. I need to figure out if I corrupted the gradient or if there's some deeper reason the SGD linear model prefers that alternative set of parameters. Unless the columns of the input matrix are excessively correlated, the optimization problem should be strictly convex and should not have a local optimum that's not the global optimum.

I will track the problem down today.

StefanKarpinski commented 12 years ago

Ok, cool. Let me know if there's anything else I can/should do here.

johnmyleswhite commented 12 years ago

Having read a bunch of papers this morning, I've made the following decisions:

  1. People need to set the intercept by making a full pass through the data and computing the mean. I'll just add this as a function that has to be called before calling fit.
  2. We get much better convergence if we reset the model weights to their average over a recent window at regular intervals. In the literature this is called, simply enough, averaging SGD.

There's still something very quirky about how the system moves when it gets near the optimum. I'll spend the rest of this week debugging it, but think we should switch gears to discussing what else needs to go into the tutorial.

StefanKarpinski commented 12 years ago

Sounds good.