working on adding a cholesky to speed up code

lizzieinvancouver commented 2 years ago

Running simulate_fit_oneslopeintercept.R to check how that ran (takes maybe an hour or less, chains look very good):

Untitled

Values pretty close; note that assigned priors in this code are very narrow, so this is maybe not the ideal test before tackling real data.

lizzieinvancouver commented 2 years ago

I then tried adding the cholesky (commit 6a317a67c96339e0182a9085616f6c34ffe8a0f2) ... see new 'temp' folder (we should delete this eventually) and never got good output. It runs fast but the chains don't converge and it never seemed to get lam or sigma for the phylo part ... even with the treedepth made higher (which makes the model take forever and thus seems beyond the point ...).

Untitled

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Do these trends seem to be what you're seeing? If so, I suggest you do a quick compare of my new and your code (we can discuss any big differences if you find them) and post back to Discouse.

lizzieinvancouver commented 2 years ago

Here's the Discourse post and content of Ben Goodrich's email:

The multi_normal() function that takes a covariance matrix calculates a Cholesky factor. And I think there was a pull request to literally make multi_normal() calculate the Cholesky factor and then call multi_normal_cholesky(). So, doing it yourself is not going to save any time.

But you can save a lot of time by only doing one Cholesky factorization. If you do

transformed data { matrix[Nspp, Nspp] L = cholesky_decompose(Vphy); }

then in the model block, you can do

a ~ multi_normal_cholesky(rep_vector(a_z,Nspp), sqrt(lam_interceptsa) sigma_interceptsa L); b ~ multi_normal_cholesky(rep_vector(b_z,Nspp), sqrt(lam_interceptsb) sigma_interceptsb L);

because your lambda_vcv() function is just rescaling Vphy so there is an analogous rescaling of L.

But calling multi_normal() or multi_normal_cholesky() on parameters is usually a bad idea. It is usually much better to utilize the stochastic representation of the multivariate normal distribution and define a and b as transformed parameters:

transformed parameters { vector[Nspp] a = a_z + sqrt(lam_interceptsa) sigma_interceptsa (L a_); // implies: a ~ multi_normal vector[Nspp] b = b_z + sqrt(lam_interceptsb) sigmainterceptsb (L b); // implies: b ~ multi_normal }

where a and b are defined as vectors of size Nspp in the parameters block and have iid standard normal priors in the model block.

DeirdreLoughnan commented 2 years ago

Thanks @lizzieinvancouver!

Our code for the corrected version is essentially the same (the only minor difference being the specification for the priors of a and b), but I agree with your overall summary.

I am currently running the model with a greater treedepth to test if I get similar values, but without it, and without the specification of initial parameters, my model output still has 4000 divergent transitions and Rhat values of 3.16 The estimates are:

DeirdreLoughnan commented 2 years ago

Aki Vehtari replied to our Stan discourse post with the following questions and comments:

the fact that our model generates 4000 transitions exceeding max tree depth suggests very high posterior correlation between some parameters. He asked if we had looked at the pairs plots and how big the ESSs for the quantities of interest are.
He did not think the estimates looked that bad. But added that a single run is not enough to say whether estimates are bad.
In asking about the Woodbury matrix code, he said to code the equation from Wikipedia. But agrees that if it seems to be running faster without it, it shouldn't be a priority
Before trying to figure out the reason for the high posterior correlations, he suggests trying a dense mass matrix, which should help with correlated posteriors, unless there are a lot of parameters and the multiplication with the big dense mass matrix dominates the computation time (which I would think would be an issue for the real data with 1200+ species.

To see his full response, see the discourse post.

DeirdreLoughnan commented 2 years ago

Using Mike Betancourt's tutorial and Faith's pairs plot code, I got the following output from the Cholesky model (without initial values): Screen Shot 2021-12-20 at 10 27 02 AM

Screen Shot 2021-12-20 at 2 24 23 PM

The ESS per leapfrog step (which reflects computational efficiency, with small values being more computationally costly) are: Screen Shot 2021-12-20 at 2 25 24 PM

@lizzieinvancouver @legault based on these outputs are there any other things we think we should try first? The multimodality in the pairs plots is concerning.

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Geoff suggested a change in code at Bayes class a couple weeks back that I did not understand so I asked and understand it better now. It is basically just moving around some of the code to hopefully run faster, he had not pushed it then, but it's now pushed. If you have time can you see how fast it runs and how it does on the output?

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Merry Christmas Eve! I am trying to do a few quick things before getting offline (so feel free to IGNORE this all). Included in that I wrote some notes on our issue here and am emailing Mike now.

DeirdreLoughnan commented 2 years ago

Thanks @lizzieinvancouver and @legault for your help with this.

I tried running the new code and initially got the error that the variable sigma_y_mu_sigma was not defined, so I added sigma_y_mu_sigma = 1 to the param object in the R file (line 59). If this sounds wrong to you, let me know.

The new code runs fairly quickly for 200 species and the output looks good enough to me:

Screen Shot 2021-12-28 at 5 09 12 PM

Although, I am currently running the model with 1200 species and it is not very fast with 1000 transitions with 10 leapfrog steps per transition taking ~10000 seconds. So while it is twice as fast as the original model, it still might be too slow.

DeirdreLoughnan commented 2 years ago

Thanks @lizzieinvancouver for taking the time to chat about this model today. The model appears to be getting stuck and the last two chains are not getting past 70%/80% respectively.

Before I move on with doing this analysis at the family level, the below are things I am going to look into:

[ ] Run the model with 2000 & 3000 iterations to see if they finish
[x] Run the ncp model on my real data and see if it works
[x] Test if it is worth changing the OSPREE phylogeny model to include Geoff's recent changes and the cholesky

I ran the OSPREE phylogeny model with the new model code and cholesky. It did improve the time of transitions a little. Previously it was taking 939 seconds on average on my computer and now it would take 880 seconds.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver @legault

A quick update on the progress I have made with the above tasks:

I am running the model for 2000 and 3000 iterations, but based on the estimated time, I think they will take another 7 days to finish. This surprised me, since I would have thought fewer iterations would run faster.
The ncp model, which unfortunately was giving us bad estimates on our test data, ran really quickly for my real data! It only took 3 days to run, but was not efficient with 8000 iterations exceeding the max tree depth and the largest Rhat being 1.08. The pairs plot does still produce a correlation between the lam_interceptb and sigma_interceptb, but other parameters don't look as bad as they did for the test data.

I also ran Geoff's revised model on my real data and the estimated time of the step size was faster, at ~1800s, but after running it overnight it was still only at 1% completion.

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Can you check the stepsize on the model with the correlations?

Also, I posted some example nested model code which you can find here.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver the stepsize for the model is very small ~0.0015.

Thanks for sharing this code! I will create a new issue in my Synchrony repo to discuss progress on the family level model, unless you think it would be useful to have it in the pmm repo as well.

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan It would be great to have it in the pmm repo as well! Slightly easier for me if we start the code there, but if starting in synchrony repo is better for you we can always move it over to pmm later.

DeirdreLoughnan commented 2 years ago

I am still holding out a glimmer of hope that we can get this model to work. After chatting with @lizzieinvancouver I am running the following models:

[x] The ncp model inspired by Ben Goodrich, but without phylogeny on the intercept, and with the real data
[x] Geoff's revised model, but without phylogeny on the intercept, and with the real data

I ran my new version of Geoff's model quickly on test data for 200 sp and most of the estimates were good, but lambda_interceptbf was slightly off. If this works I will go back and look into this further. On the real data the leapfrog esti time is 7200s, so I estimate it will take about a week to run.

The ncp model without phylogeny on the intercept runs much faster (~200s leapfrog steps), so I will share on update of the output and the pairs plot later today or early tomorrow.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver a quick update on the above: The ncp model ran in about 19 hours but the output is still quite poor. The model had 8000 transitions that exceeded the maximum tree depth and produced 17 values with n_eff less than 600.

There is still the strong correlation between the lambda and sigma interceptb values: Screen Shot 2022-01-26 at 3 10 47 PM

But the estimates are at least not wild numbers:

Screen Shot 2022-01-26 at 3 08 23 PM

Geoff's model with just the phylogeny on the intercept is running much faster and was at 30% after 24h. I will most an update on the output later this week.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver @legault The model with the real data and Geoff's model without phylogeny on the intercept finished running in about 2.5 days.

Unfortunately the model produces 30 divergent transitions and the lam_interceptbf and sigma_intercept_bf are still correlated: Screen Shot 2022-01-28 at 4 22 32 PM

The estimates don't look too weird though: Screen Shot 2022-01-28 at 4 16 33 PM

Screen Shot 2022-01-28 at 4 12 08 PM

Given that there are so few divergent transitions, do we think it worth trying to do more to fix the model? Like modifying the priors?

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Yes! I think this is hopeful. We shouldn't modify priors unless they look like maybe they are wrong. So take a look and see what you think. How many warmup iterations currently? That's an 'easy' one to try playing with ... and what about adapt_delta? Try it just a touch higher than default? Or is it already high?

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan What's the prior on sigma_interceptsbf?

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver the model is just run with 2000 iterations and no changes to the adapt_delta. I will try running it with a larger warm up and another version with a higher adapt_delta and see if we get any improvements.

The prior on sigma interceptbf is (5,5), which in retrospect seems odd.

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Oh, I would definitely up it to 4K iterations with 3K warmup to start. That prior seems okay to me actually.

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Another to do item ... run your model without phylogeny with partial pooling only on the slopes and compare your slope estimates across the two models.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver I can run that model, but I ran and already have output for a similar model that included the effects of study. The slope estimates are pretty close, with the phylogeny model predicting b_zf = -0.327, and the study model predicting b = -0.328. I also compared the slopes level slopes quickly: Screen Shot 2022-01-30 at 12 06 17 PM

Most are lining up pretty well (the wild outlier on the right is for the Eurasian bullfinch, for which we have data from Japan and Russia, so the study effect is considerable). I thought this was cool and worth sharing.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver I ran Geoff's version on the phylogeny model with phylogeny on the slopes only for the test data and I think we are on the right track, but I made a mistake and may need to re-run it. The prior I gave the model for mu_a was 4, but I generated the intercepts with a mean of 10 (oops). I will re-run the model with the same value for both. But here is the output just in case it is still worth talking about: Screen Shot 2022-02-04 at 10 32 05 AM

The mean step size is: 0.022 and the neff range from 1272 to 14408.

Screen Shot 2022-02-04 at 10 36 21 AM

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver I reran the model on test data with wider priors and I think it is still doing a pretty good job of estimating the priors: Screen Shot 2022-02-08 at 4 30 48 PM

The mean stepsize is: 0.024, n_eff ranges from 1265.2-8592.6, and Rhats are good.

I also ran a model with phylogeny on the intercept only for the test data and it too seems to be working well: Screen Shot 2022-02-08 at 4 20 27 PM

The mean stepsize is 0.0303, with good n_eff ranging from 1192.1 - 8270.5, and Rhat values of 1.

The pairs plot shows less of a correlation between the lambda and sigma: Screen Shot 2022-02-08 at 4 27 27 PM

DeirdreLoughnan commented 2 years ago

Finally, I ran the model with phylogeny on the intercept only on the real data:

The model runs well, with a stepsize of 0.1059 and n_eff that range from 2757-23318, the Rhat values are 1. Screen Shot 2022-02-08 at 4 57 32 PM

But the pairs plot shows a weird relationship between the lam_intercepta and sigma_intercepta Screen Shot 2022-02-08 at 4 58 27 PM

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver Aki Vehtari does not think we need to be worried about the high correlation in the slope model. He said that the MCMC appears to be working well and that it is good enough.

Although the correlation for my intercept model is worse. I will take another look at the code first and see if I am just doing something wrong.

lizzieinvancouver commented 2 years ago

Although the correlation for my intercept model is worse. I will take another look at the code first and see if I am just doing something wrong.

That makes sense! Because we think the intercept model is likely not great. It might be worth comparing the no pooling intercepts (no Stan, just a bunch of lm models) to the partial pooling and see how bad the effect is.

DeirdreLoughnan commented 2 years ago

Thanks @lizzieinvancouver, would you be available to chat about this some time this week?

I am just not sure I understand what you are suggesting. Looking back at my notes from January of last year, we did compare the no pooling to partially pooling models, but with a subset of the data (it looks like I excluded the Thackeray data). Is this what you have in mind?

Or do you mean to compare a no pooling and partial pooling model with the phylogenetic component?

I also recall having issues running linear models with this dataset, but I have run versions of it with rstanarm.

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Yes! That's what I mean. You should also just plot histograms of the intercepts for each species from the models. You can use lm for the simple linear models. I'll be around this week (M-Th) and happy to chat, but I am not sure when yet.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver I tried running the model with phylogeny on both the intercept and the slope again...and it finished!!!

The model does produce 12 divergent transitions, but the warmup was only 2000 and 4000 iterations. Do you think it fine to just run it again with 3000 warmup and 4000 iterations and see if that helps, or up it a lot to say 4000 and 6000?

The mean stepsize is pretty good at 0.103, as are the neff (2463 - 21587) and Rhat.

The model estimates are pretty similar to the estimates from the intercept only and slope only model: Screen Shot 2022-02-22 at 11 45 54 AM

Screen Shot 2022-02-22 at 11 40 25 AM

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Ohmy -- that's brilliant! Bravo. Yes, I would just up the warm-up as you suggested (4K with 3K warmup). I think you can run with these results though for plotting and interpreting (I doubt the 12 divergent transitions are hiding a deep, dark pathology -- but we should get a run without any eventually). How long did this model take?

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver I am very excited by these results!

Great! I was actually surprised by how quickly it ran, it finished in about 6 days.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver @legault I chatted with Geoff last week about ways to potentially get rid of the pesky last divergent transition.

Geoff's suggestion was to try a more skewed beta distribution on lambdaintercept_b, since the model seemed to be log posterior seemed to be up against zero, but the estimates ranged from essentially 0 to 0.3. I reran the model with a beta prior of (2,5).

The model ran with no warning messages (yay!), the neff and rhat are good (yay!), but the estimate of lambda_b essentially doubled:

Screen Shot 2022-03-07 at 9 31 01 PM

Screen Shot 2022-03-07 at 9 39 40 PM

My understanding is that this is bad and may suggest that the model estimate is completely dependent on the prior. But does that mean this model is not useful?

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan That's interesting. Could you plot the lambda_b prior versus the posterior (same plot -- in a way we can easily compare them) for this model? And also the older model? Thanks!

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver Absolutely!

This is the old model, with the prior (1,1) in pink and the posterior in blue:

This is the new model, with a prior of (2,5), again, the posterior is in blue:

DeirdreLoughnan commented 2 years ago

Based on these, I would say the model is better, since the posterior and the prior overlap more, but most of the posterior estimates are on the edge of the prior distribution.

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Thanks! The posteriors look almost identical across the two plots so I cannot figure how your estimate is so different ... can you plot the full histogram (the top is cut off on the left side perhaps?) and the mean estimate.

DeirdreLoughnan commented 2 years ago

@lizzieinvancouver oosp, the posteriors were the same!

I fixed it, here is the old model:

And the new model:

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Ah, that all makes sense now. Huh! Well, I am not sure what to do. Unless @legault has a good sense of which is better, I suggest you post to Stan Discourse and ask about this (you can link to your old post). They will likely have a better sense than any of us.

The one thing I will say is that this is a WIDE posterior ... you might try to compare it to what Nacho got for his phylogeny work. If his posteriors are tighter this suggests to me you have a lot of spread in your lambda_b estimate.

DeirdreLoughnan commented 2 years ago

Perhaps it would be more appropriate to use a prior distribution of (1,3)? I could try running this and see if it produces issues.

Geoff's suggestions were to: 1) justify the existing priors (via argument, prior predictive check, or literature) via prior predictive plots and/or expert knowledge (so looking at Nacho's model might be useful) or (2) widen the priors a little bit at a time (or a lot at time) until they're as wide as they can be while still producing good diagnostics.

It is less clear to me how point 2 would work, given the priors are already pretty wide.

DeirdreLoughnan commented 2 years ago

Thanks @lizzieinvancouver I will draft a discourse post and see what people think!

lizzieinvancouver commented 2 years ago

@DeirdreLoughnan Great! See what they say.

I do think the difference (0.07 to 0.11 with a large posterior variance) is not a giant one in terms of biology so the new prior is okay by me.

legault commented 2 years ago

@lizzieinvancouver @DeirdreLoughnan , I agree there's probably not much difference between 0.07 and 0.11 biologically. Given the influence of the prior distributions, there also likely isn't enough information in the data alone to distinguish between those values.

lizzieinvancouver / pmm

working on adding a cholesky to speed up code #15