Open johnmyleswhite opened 12 years ago
Two ideas:
I'm guessing the problem is just that SGD is a little hacky. I need to do some more reading, since I know there are hacks to get better results. I need to think more about whether the standard results actually promise reaching the right coefficients or whether they just promise reaching the right predictions. I also still need to let the MSD example run longer to see how well it does when it's left running for ten or twenty passes through the data.
After letting the real example run for multiple passes, it's clear that it is converging -- just very, very slowly. Initializing the intercept term to the mean of the data set speeds things up considerably, but we're still looking at many passes through to hit the global optimum. I'll see how easily I can get second order SGD to work. If it's hard, I say we just use this an illustrative example.
Aggregating batches is totally reasonable and is something that might be advisable for pragmatic users, although it will probably freak out academic users who will ask for a proof of convergence. Still worth considering, since I'm sure it will give much more robust results.
If each batch is supposed to converge, then surely the median of many batches converges more robustly, no?
I think that's probably right, but there are some subtleties in the argument I don't have time to work out right now since this is just a simple example for a tutorial.
Having done tests, I think the problem is something else. If you initialize the model to the global optimum as reported by R, then let it run, it robustly heads away to an alternative position, which is also the position it converges to if you initialize it to a reasonable starting set of parameters. I need to figure out if I corrupted the gradient or if there's some deeper reason the SGD linear model prefers that alternative set of parameters. Unless the columns of the input matrix are excessively correlated, the optimization problem should be strictly convex and should not have a local optimum that's not the global optimum.
I will track the problem down today.
Ok, cool. Let me know if there's anything else I can/should do here.
Having read a bunch of papers this morning, I've made the following decisions:
There's still something very quirky about how the system moves when it gets near the optimum. I'll spend the rest of this week debugging it, but think we should switch gears to discussing what else needs to go into the tutorial.
Sounds good.
Looking at the toy example, it's clear that the model converges as well as it can very quickly. We should set up a way to have the SGD algorithm run until convergence, which may take less than a full pass or may take many passes.
There's also a deeper issue, which is that the model doesn't really properly converge in the absence of large batch processing of at least one large-scale minibatch: using single rows at a time, it just reaches an area around the Maximum Likelihood Estimate and wanders. You can see this very clearly in the toy data logging results: the parameters shift around the MLE over and over again, but their center is exactly the MLE.