Disagreeing Results from Matrix_Exp Model

charlesm93 commented 7 years ago

A while ago, Milad mentioned the parameter estimations were slightly different for the partial pooling model using the matrix exponential. I looked into it and indeed found a few discrepancies, notably for:

gamma[3] : 0.6 +/- 0.29 vs 0.51 +/- 0.23 turnover[3]: 1055 +/- 579.8 vs 7552 +/- 200000 a21[5]: 0.4 +/- 0.24 vs 0.23 +/- 0.19 a32[1]: 0.31 +/- 0.22 vs 0.53 +/- 0.26

The first term is the estimation from the hierarchical model using a stiff solver, the second using a matrix exponential solution. I loosely define a discrepancy when the stiff and nonstiff solvers agree but the matrix_exp estimation doesn't. I'm not sure how bad these results are.

Ok - what could have gone wrong?

I tested the deterministic features of all three models and used the generated quantities block to simulate data for CO2_flux_hat with fixed parameters. I used the "real" values for the parameters. The results all agreed with one another (the largest absolute difference between two data points was 1e-7).

Would people like a PDF report of this? Could be good for QC.

Maybe nothing went wrong and we just got unlucky. Here's what I propose to do next:

Run all three models on the same machine. This would allow me to look at another weird result: the non-stiff solver ran faster than the matrix exponential solver. This seems suspect. In theory the matrix_exp should be faster. This is also what I have observed in practice when doing pharmacometrics modeling. I'll also look at effective independent samples per unit time, rather than just run time (note: this will have to be done for each parameter).
It might be worth doing a more thorough comparison of the parameter estimations: say run each model 100 times and see how the parameter estimations distribute. I have an R script for that, but I've only tested it on Metworx (Metrum's computer cluster).

bob-carpenter commented 7 years ago

We should be discussing this on the users list or on the dev list. Could you open a discourse topic?

On Feb 4, 2017, at 12:02 PM, Charles Margossian notifications@github.com wrote:

A while ago, Milad mentioned the parameter estimations were slightly different for the partial pooling model using the matrix exponential. I looked into it and indeed found a few discrepancies, notably for:

gamma[3] : 0.6 +/- 0.29 vs 0.51 +/- 0.23 turnover[3]: 1055 +/- 579.8 vs 7552 +/- 200000 a21[5]: 0.4 +/- 0.24 vs 0.23 +/- 0.19 a32[1]: 0.31 +/- 0.22 vs 0.53 +/- 0.26

What are the values after +/-? We want to look at MCMC standard error around the mean estimate. They should be within a couple errors of each other.

It'd help to rescale the turnover parameter to something more unit-like.

The first term is the estimation from the hierarchical model using a stiff solver, the second using a matrix exponential solution. I loosely define a discrepancy when the stiff and nonstiff solvers agree but the matrix_exp estimation doesn't.

It should be when they report good R-hat convergence and have high enough n_eff that you can tell they got different posterior means (or interval boundaries) up to MCMC error.

I'm not sure how bad these results are.

Unless there's non-convergence somewhere, this is really bad. Something's going wrong in the function and no exceptions are being raised. This is actually the worst thing that can possibly happen---we get the wrong results with no warning.

Ok - what could have gone wrong?

Arithmetic overflow. That turnover parameter is ripe for overflow if there are exp() involved.

I tested the deterministic features of all three models and used the generated quantities block to simulate data for CO2_flux_hat with fixed parameters. I used the "real" values for the parameters. The results all agreed with one another (the largest absolute difference between two data points was 1e-7).

Would people like a PDF report of this? Could be good for QC.

What we really need are unit tests, not PDFs.

Maybe nothing went wrong and we just got unlucky.

The only way that happened is if the models didn't converge.

Here's what I propose to do next:

• Run all three models on the same machine. This would allow me to look at another weird result: the non-stiff solver ran faster than the matrix exponential solver. This seems suspect.

This is what I was bringing up in the meeting---the matrix exp solver looks really expensive if you're just autodiffing that big approximation, whereas the non-stiff solver uses an interpolation algorithm that can make multiple solutions a lot faster than a single one.

In theory the matrix_exp should be faster.

What's your reasoning here? I don't see how you could say that a priori.

This is also what I have observed in practice when doing pharmacometrics modeling.

Were you in a setup where you needed derivagives?

I'll also look at effective independent samples per unit time, rather than just run time (note: this will have to be done for each parameter).

Of course. And over mutliple distinct chains.

• It might be worth doing a more thorough comparison of the parameter estimations: say run each model 100 times and see how the parameter estimations distribute. I have an R script for that, but I've only tested it on Metworx (Metrum's computer cluster).

If there's not a bug, the means should be within a few MCMC std errors of each other if you run to convergence and use the same data in each run.

Bob

charlesm93 commented 7 years ago

@bob-carpenter Thank you for the comments! I'm happy to move the conversation to discourse, with the authorization of @ktoddbrown and @milkha.

What are the values after +/-? We want to look at MCMC standard error around the mean estimate. They should be within a couple errors of each other.

Yes, I'm looking at the MCMC standard error (sd column obtained when using print() in RStan).

Arithmetic overflow. That turnover parameter is ripe for overflow if there are exp() involved.

That's a possibility. The inverse of the turnover rates in the ODEs have order e-3, e-1 and e+0. I don't think that's enough of a spread in order of magnitude to cause an arithmetic overflow, but we'll have to test it.

In theory the matrix_exp should be faster. What's your reasoning here? I don't see how you could say that a priori.

We evaluate the matrix exponential only at times where we want a solution. The integrator on the other hand builds the solution function one step at a time, until it reaches the times at which we desire a solution. So many more "solutions" get evaluated with the integrator, which is why a numerical integration is more expensive. Bill, Sebastian, and Michael should be able confirm or nuance my explanation.

The models I used in pharmacometrics did require derivatives. It was a very similar situation, where the parameters intervened in the coefficients of the ODEs. But my comment is more the result of experience than a formal test.

If there's not a bug, the means should be within a few MCMC std errors of each other if you run to convergence and use the same data in each run.

The results do agree within MCMC sd, which is why I'm not sure there is something wrong; and which is why I want to run a few tests. But it's odd to get 0.4 +/- 0.24 and 0.41 +/- 0.25 for respectively the stiff and non-stiff solver but 0.23 +/- 0.19 for the matrix exp solution. I think this is what @milkha picked up on. I need to run more convergence tests (R_hat = 1, but that's not enough). I will write a more heavy version century_experiments.pdf with more diagnostics (it will be a new markdown file).

charlesm93 commented 7 years ago

I'll start by reiterating @bob-carpenter 's request to post this on the user mail list, if that's ok with @ktoddbrown and @milkha .

Yes, I'm looking at the MCMC standard error (sd column obtained when using print() in RStan).

I realize I made a mistake here. The MCMC standard error is se_mean, not sd. I looked at the wrong metric.

I wrote convergence_tests.R, a script that writes and saves a few diagnostic tests. The company's computer cluster is currently upset with RStan -- a bug I need to fix, I think it has to do with different versions of R packages... -- so I used cmdStan (2.14). I think the script is still readable, but that's easy for me to say... I ran each model with 2000 iterations (1000 for warm-up and 1000 for for sampling).

First of, convergence does not happen.

I asked @billgillespie to check the results, and he explains it very well:

With both matrix exp and rk45 there are sampling problems. The clue is that there are divergent transitions. You can kinda guess where they happen when you look at the traceplots. Values get stuck at extremes for a while.

It looks like you are using defaults for all of the tuning parameters. A first attempt at a remedy is to increase adapt_delta to >= 0.9 (the default is 0.8) and see if that makes the divergent transitions go away.

Do you know if the time per iteration is particularly slow during warmup, particularly during its early stages? If so, I recommend specifying initial estimates for all of the parameters. At the moment it looks like you are letting rstan/stan pick them. Actually I almost never rely on the defaults. Another thing that can slow down early iterations is Stan being overly adventurous and jumping into parts of the parameter space that make things get weird. If that's true then setting stepsize to a small value like 0.01 may help.

So step 1: adapt_delta = 0.9 and specify initial estimates. If the divergent transitions go away, then compare runs. If they don't, we may need to explore reparameterization.

The trace-plots Bill refers to are the ones for turnover[2] and turnover[3]. You can also see a little bit of autocorrelation if you look at lp__, even though R_hat < 1.1.

With this run, I could not reproduce the first observed discrepancy in the parameter estimation. I did however observed once again that century_hier_nonstiff.stan runs twice as fast than century_hier_me.stan, but it produces a lower number of effective samples.

Before making any further comparisons, I will try to reach convergence, following Bill's recommendations. For the initial estimates, I'll simply sample from the prior distributions.

milkha commented 7 years ago

... if that's ok with @ktoddbrown and @milkha .

That's fine by me.

milkha commented 7 years ago

@charlesm93, Also, please pay attention that our data generation process have changed a few times so far (measurement times, cap times, noise, etc.) I modified these according to comments I got from Kathe and Sean. The latest versions can be found in the server runs folder which contains the code I am currently running on the server for the power analysis. It would be good if you could use the same parameters.

charlesm93 commented 7 years ago

@milkha I regularly pull gitHub to make sure I catch your updates. I'm assuming you no longer use synthetic_data/simulate_data_century.R to simulate the data but synthetic_data/Server_runs/simulate_data_century.R?

milkha commented 7 years ago

@charlesm93 , Yes, that's correct. I try to keep all the files up-to-date.

charlesm93 commented 7 years ago

@milkha Noted. I did my first analysis with the old simulation file. Will rerun with the new files.

ktoddbrown / decomPower

Disagreeing Results from Matrix_Exp Model #7