msmbuilder / msmbuilder-legacy

Legacy release of MSMBuilder
http://msmbuilder.org
GNU General Public License v2.0
25 stars 28 forks source link

MLE getting stuck for hours #16

Closed rmcgibbo closed 11 years ago

rmcgibbo commented 12 years ago

Anyone seeing this type of thing when running the MLE?

Log-Likelihood after 12145 function evaluations: -2623598.26493
Log-Likelihood after 12178 function evaluations: -2623598.26493
Log-Likelihood after 12211 function evaluations: -2623598.26493
Log-Likelihood after 12244 function evaluations: -2623598.26493
Log-Likelihood after 12277 function evaluations: -2623598.26493
Log-Likelihood after 12310 function evaluations: -2623598.26493
Log-Likelihood after 12343 function evaluations: -2623598.26493
Log-Likelihood after 12376 function evaluations: -2623598.26493
Log-Likelihood after 12409 function evaluations: -2623598.26493
Log-Likelihood after 12442 function evaluations: -2623598.26493
Log-Likelihood after 12475 function evaluations: -2623598.26493

I'm getting this type of printout a lot. Maybe the convergence criteria is set too high?

kyleabeauchamp commented 12 years ago

To maintain the same error rate, I had to increase the convergence criteria when I changed from my iterative MLE to Lutz's minimization MLE. Perhaps we were better off keeping my code...

On 09/03/2012 11:14 AM, Robert McGibbon wrote:

Anyone seeing this?

Log-Likelihood after 12145 function evaluations: -2623598.26493 Log-Likelihood after 12178 function evaluations: -2623598.26493 Log-Likelihood after 12211 function evaluations: -2623598.26493 Log-Likelihood after 12244 function evaluations: -2623598.26493 Log-Likelihood after 12277 function evaluations: -2623598.26493 Log-Likelihood after 12310 function evaluations: -2623598.26493 Log-Likelihood after 12343 function evaluations: -2623598.26493 Log-Likelihood after 12376 function evaluations: -2623598.26493 Log-Likelihood after 12409 function evaluations: -2623598.26493 Log-Likelihood after 12442 function evaluations: -2623598.26493 Log-Likelihood after 12475 function evaluations: -2623598.26493

I'm getting this type of printout a lot. Maybe the convergence criteria is set too high?

— Reply to this email directly or view it on GitHub https://github.com/SimTk/msmbuilder/issues/16.

rmcgibbo commented 12 years ago

Maybe we can bring Lutz in on this?

I haven't looked at the code closely, but I can instrument it a bit more to see where this is coming from. The first step is just printing more digits of that log likelihood number, to see what kind of fluctuations were looking at.

For instance, if the optimizer is trapped on some kind of cyclic thing -- which I think can happen in these gradient optimizers -- we can add some damping.

Another possibility, though more drastic, is to try L-BFGS-B instead of truncated conjugate newton.

Sent from my iPod

On Sep 3, 2012, at 12:20 PM, kyleabeauchamp notifications@github.com wrote:

To maintain the same error rate, I had to increase the convergence criteria when I changed from my iterative MLE to Lutz's minimization MLE. Perhaps we were better off keeping my code...

On 09/03/2012 11:14 AM, Robert McGibbon wrote:

Anyone seeing this?

Log-Likelihood after 12145 function evaluations: -2623598.26493 Log-Likelihood after 12178 function evaluations: -2623598.26493 Log-Likelihood after 12211 function evaluations: -2623598.26493 Log-Likelihood after 12244 function evaluations: -2623598.26493 Log-Likelihood after 12277 function evaluations: -2623598.26493 Log-Likelihood after 12310 function evaluations: -2623598.26493 Log-Likelihood after 12343 function evaluations: -2623598.26493 Log-Likelihood after 12376 function evaluations: -2623598.26493 Log-Likelihood after 12409 function evaluations: -2623598.26493 Log-Likelihood after 12442 function evaluations: -2623598.26493 Log-Likelihood after 12475 function evaluations: -2623598.26493

I'm getting this type of printout a lot. Maybe the convergence criteria is set too high?

— Reply to this email directly or view it on GitHub https://github.com/SimTk/msmbuilder/issues/16.

— Reply to this email directly or view it on GitHub.

kyleabeauchamp commented 12 years ago

We might want to consider switching back to my code. It was half as many lines of code and didn't have these kinds of issues...

On 09/03/2012 12:52 PM, Robert McGibbon wrote:

Maybe we can bring Lutz in on this?

I haven't looked at the code closely, but I can instrument it a bit more to see where this is coming from. The first step is just printing more digits of that log likelihood number, to see what kind of fluctuations were looking at.

For instance, if the optimizer is trapped on some kind of cyclic thing -- which I think can happen in these gradient optimizers -- we can add some damping.

Another possibility, though more drastic, is to try L-BFGS-B instead of truncated conjugate newton.

Sent from my iPod

On Sep 3, 2012, at 12:20 PM, kyleabeauchamp notifications@github.com wrote:

To maintain the same error rate, I had to increase the convergence criteria when I changed from my iterative MLE to Lutz's minimization MLE. Perhaps we were better off keeping my code...

On 09/03/2012 11:14 AM, Robert McGibbon wrote:

Anyone seeing this?

Log-Likelihood after 12145 function evaluations: -2623598.26493 Log-Likelihood after 12178 function evaluations: -2623598.26493 Log-Likelihood after 12211 function evaluations: -2623598.26493 Log-Likelihood after 12244 function evaluations: -2623598.26493 Log-Likelihood after 12277 function evaluations: -2623598.26493 Log-Likelihood after 12310 function evaluations: -2623598.26493 Log-Likelihood after 12343 function evaluations: -2623598.26493 Log-Likelihood after 12376 function evaluations: -2623598.26493 Log-Likelihood after 12409 function evaluations: -2623598.26493 Log-Likelihood after 12442 function evaluations: -2623598.26493 Log-Likelihood after 12475 function evaluations: -2623598.26493

I'm getting this type of printout a lot. Maybe the convergence criteria is set too high?

— Reply to this email directly or view it on GitHub https://github.com/SimTk/msmbuilder/issues/16.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/SimTk/msmbuilder/issues/16#issuecomment-8245885.

rmcgibbo commented 12 years ago

Start by doing it in a branch? Then I can test them side by side.

schwancr commented 12 years ago

I've seen this in some lagtimes, but not all of them.

rmcgibbo commented 12 years ago

The github "comment by email" seemed to cut off the other paragraphs of my response. Here's what I meant to say


Maybe we can bring Lutz in on this?

I haven't looked at the code closely, but I can instrument it a bit more to see where this is coming from. The first step is just printing more digits of that log likelihood number, to see what kind of fluctuations were looking at.

For instance, if the optimizer is trapped on some kind of cyclic thing -- which I think can happen in these gradient optimizers -- we can add some damping.

Another possibility, though more drastic, is to try L-BFGS-B instead of truncated conjugate newton.

kyleabeauchamp commented 12 years ago

So I recently wrote some entropy maximization code. I played around with several different implementations of the objective function / constraints (e.g. normalization) and found that the following gave fast and robust convergence:

  1. Instead of working with populations and a constrained problem, I worked with non-normalized log populations. That is, p = exp(u) / Z, where u is a free energy. The addition of Z leads to a few extra chain rule terms in the derivatives, but they weren't too bad to work with.
  2. I used the fmin_l_bfgs_b without any bounds or constraints.

I think this might be the way to go for the MLE estimator, but I haven't actually implemented it.

kyleabeauchamp commented 12 years ago

I do not think that using L-BFGS-B will work with the constrained minimization problem with raw populations / Tij. The issue is that the scaling of the likelihood function always seems to cause issues with the line search. This was my experience with the maxent code, and I recall similar behaviour with the MLE stuff as well.

Note that the maxent code is not for MSM stuff, but for another project. It was written for the case of a population vector, not a normalized transition matrix. However, I think the lessons learned still apply.

kyleabeauchamp commented 11 years ago

If people want this one fixed, could they please upload an example of the slow convergence?

I might take another stab at fixing the MLE code.

rmcgibbo commented 11 years ago

What are you looking? I think I can find you a transition matrix of less than 100 dimensions -- will that work?

kyleabeauchamp commented 11 years ago

I just need any example of a count matrix that gets "stuck".

On 10/30/2012 05:53 PM, Robert McGibbon wrote:

What are you looking? I think I can find you a transition matrix of less than 100 dimensions -- will do it?

— Reply to this email directly or view it on GitHub https://github.com/SimTk/msmbuilder/issues/16#issuecomment-9928834.

kyleabeauchamp commented 11 years ago

So I converted the MLE problem to minimize in "log space"--that is, X{ij} = (1/Z) exp(u{ij}). This makes it an unbounded problem, which could have advantages.

I've got code that is object oriented, correct, and almost as fast as the current MLE. The question is whether it is "more robust".

One argument for using my new code is that I don't do any of this "restarting" nonsense. I think that working in logspace makes that procedure unnecessary, but I can't say for sure.

kyleabeauchamp commented 11 years ago

I also have a latex document that documents the calculation of the log likelihood and its gradient. If we switch to this, I think we should include a directory "Notes" in the Docs tree, and reference that PDF in our docstrings. In general, I think any calculation that is "nonobvious" should probably have some sort of latex writeup, either in the published literature or in a PDF in MSMBuilder.

kyleabeauchamp commented 11 years ago

See https://gist.github.com/3991712 for example test code.

kyleabeauchamp commented 11 years ago

Any update on test cases that are failing with the current code?

kyleabeauchamp commented 11 years ago

I found a test case that seems to converge extremely slowly. Alanine dipeptide, 1,000 ns, frames every 500 fs, 80 microstates with hybrid clustering.

It seems that the large number of counts may have something to do with the convergence issues--I think this is consistent with what others have seen.

kyleabeauchamp commented 11 years ago

So my rewrite of the MLE code does not stall on the previous fail case. My new code is also much cleaner, IMHO.

I'm going to take that as evidence that I should prepare my new code for an eventual merge.

kyleabeauchamp commented 11 years ago

I think this issue was fixed with issue #119

Reopen if someone sees slow convergence again.