Issues in running joint model including bad chains in two intercept model (formerly: Test the model's sensitivity to the priors)

DeirdreLoughnan commented 3 years ago

@lizzieinvancouver I have been testing the model's sensitivity to the betatraitxcue priors. I recall having issues with the model and negative values at one point, but it appears that I already fixed that issue.

The test data does have a positive prior for the betatraitxcues, but I still tested them using wider priors to see how this impacted the model estimates. It appears that when the prior is widened the estimates for the mu_phenosp get worse even with a relatively high number of species and studies.

Here is the summary from the previous test data with a prior of (2,1) for the betatraitxcues: Screen Shot 2021-08-05 at 2 36 08 PM

And here is the summary from a wider prior of (2,3) for the betatraitxcues: Screen Shot 2021-08-05 at 3 21 06 PM

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan Thanks. Are you talking about lines 14-16? They look pretty similar to me..

Also, is your prior normal(2,1)? If so I really think you should be centering it at 0, as I don't see us having prior knowledge that would center it at 2. Or am I missing something?

DeirdreLoughnan commented 3 years ago

Thanks @lizzieinvancouver!

To clarify, yes I meant that I changed the priors for lines 14-16 of the summary output (the betatraitxcues), and in doing so the estimate on line 8 (mu_phenosp) got worse. Doubling the Nspp only improved the estimate of mu_phenosp marginally.

This morning I worked on running the model with priors at (0,1) the estimates for the betatraitxcue variables were good (line 14-16), but the estimates for mu_phenosp were worse (line 8). Screen Shot 2021-08-06 at 12 11 42 PM

DeirdreLoughnan commented 3 years ago

I have also noticed that the test data is also producing very low n_eff for muSp and muStudy, but not producing a warning message at the end of the model run. Some of the priors are very narrow, so I figure I will start by widening these priors and seeing if that helps.

lizzieinvancouver commented 3 years ago

I have also noticed that the test data is also producing very low n_eff for muSp and muStudy, but not producing a warning message at the end of the model run. Some of the priors are very narrow, so I figure I will start by widening these priors and seeing if that helps.

@DeirdreLoughnan That sounds like a good place to start. If you find issues there you should backtrack to checking the trait-only model.

I also suggest you plot the true hyperparameters (those parameters underlying mu_phenosp, so the values for each species) versus the model estimated ones. Those must be off for mu_phenosp to be off.

This morning I worked on running the model with priors at (0,1) the estimates for the betatraitxcue variables were good (line 14-16), but the estimates for mu_phenosp were worse (line 8).

Which priors were (0,1)? Could you also give the nsp and how much worse the mu_phenosp estimates were.

How long does the model take to run? Maybe I can run it some time next week and take a look....

DeirdreLoughnan commented 3 years ago

I have since made several changes to the priors and will try to outline the more clearly. The line numbers of the changed priors will be in reference to the code found here. This is the associated R code.

Lines 142, 143, 144: I changed the priors from (2,1) to be less informative at (0,1) This resulted in good estimates for most variables, but made the estimate of the mu_phenosp worse, with an estimate of 136, while the test data has a value of 150.
I also widened several of the priors to try and improve the n_eff of the model 2a. Line 108 the prior was changed from (10, 0.5) to (10,1) 2b. Line 112 the prior was changed from (5, 0.5) to (5,1) 2c. Line 109 the prior was changed from (10, 0.1) to (10, 1) These changes improved the model estimates and fixed the above issue with muphenosp!

I have been working with a minimum of Nspp 20 species and doubling that to 40 to see if it improves the model estimates. If you have time to look into the model more this week Lizzie, that would be amazing. The model with Nspp = 40 takes about 20-25 minutes to run.

I also spoke with @FaithJones and @legault about the n_eff. My take away from the conversation is that the model might just be ineffective but it does not mean that the estimates are bad per se.

Faith suggested making more plots, specifically:

plots of the model estimates compared against the test data values
histograms of the posterior with the predicted value
If some parameters are particularly bad, plot the parameter values against each other
Plot the sigmas and the predicted distribution values for the variable, may need to log the sigma

I am still working on making these plots and can post some examples when I am done.

lizzieinvancouver commented 3 years ago

2. I also widened several of the priors to try and improve the n_eff of the model 2a. Line 108 the prior was changed from (10, 0.5) to (10,1) 2b. Line 112 the prior was changed from (5, 0.5) to (5,1) 2c. Line 109 the prior was changed from (10, 0.1) to (10, 1) These changes improved the model estimates and fixed the above issue with muphenosp!

Nice work @DeirdreLoughnan! Some thoughts ...

2a -- normal(10,1) gives values between roughly 7 and 14, that is pretty narrow for real data. How do you have such a good idea of sigma_y (for real data eventually)? Is it based on the previous ospree models? If you don't have good reasoning you need something bigger .... is the model in natural or z-scored units?

2b -- this is for sigma study? normal(5,1) goes from 1 to about 9 ... are you fairly confident your real data will be in that range? If not, make it bigger.

I also personally would set sigma_y and sigma_study similar as I am not sure I have a reason to expect one to be bigger (or smaller) than the other.

Similar to the above comments, these were SUPER narrow priors you started with (for example, run hist(rnorm(1000, 10, 0.1)) ) so I think you may need to think about going even bigger on many of these basically. You want a test data model to run on your real data, so you want priors that you think are reasonable (read -- reasonably wide) for your real data.

To update @FaithJones suggestions ....

Yes, always good. I also suggested you plot the species level values that make up muphenosp against real values. I am not sure how critical more plots beyond that are just now (I mean, you can just plot your columns you have against one another).
Always good!
Yes, but also against the log posterior.
Why?

I will try to take a quick look at your model now.

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan I ran your model and here are some thoughts ...

I agree a low n effective may not be a real issue, often you can fix this later. That said, it's not super ideal to have it on test data.... I also found your lower level parameters do not match super. Have you checked something along the lines of the below in your trait only Stan code?

sumer <- summary(mdl.jointfcp)$summary
plot(mu.study , sumer[grep("muStudy\\[", rownames(sumer)), "mean"])
abline(a = 0, b = 1, lty = "dotted") # not amazing ... but not horrible

Also, these trace plots are not the end of the world, but also not what we dream of in test data: caterpillar_notquite

lizzieinvancouver commented 3 years ago

This is better ...

cater_morefuzz

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan I suggest you ...

Check your priors again.
Check your mu_study[1] ... to mu_study[n] values on your trait only model and that they look good.
Your trt.er is SUPER high (it's bigger than most other values in the model? Is that true?) ... for test data you don't necessarily need such high error, ditto for phenology error. I would dial these way down (to 1 or 0.5 to start).

I am doing a run with more reps (Nrep = 20 and nphen =20) and will report back then push my code.

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan

Okay, it looks a little better. It still has n eff issues, but on a few fewer parameters. And the mu_study values look better (see plot. So I wonder if small sample size (20 species is not much, and your reps are low, some improvements to priors, smaller error) might be the issues.... I can take a look at your trait only model by the way, if that helps.

rplot

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan I played around with the 3cue model this afternoon. I didn't have any breakthroughs, but tweaked a few things, and still wonder if the model is running okay and we're just asking too much. To help me with that when time allows can you:

[ ] add to mdl.out the 25 and 75% intervals so we can see if they overlap the true value?
[ ] double check that the code below is correct... I don't know your code well and I am NOT sure we're actually plotting the test values against the correct parameters. Are you sure that we are?

plot(alpha.pheno.sp , (sumer[grep("mu_grand", rownames(sumer)), "mean"] + sumer[grep("muSp\\[", rownames(sumer)), "mean"]))
abline(a = 0, b = 1, lty = "dotted") # hmm, I must not be plotting the right things ...

Thanks also for posting, I'll reply or ask for help once I know more.

lizzieinvancouver commented 3 years ago

For reference, here's my output at nspp = 50:

nspp50

And here it is at nspp =80

nspp80

lizzieinvancouver commented 3 years ago

Also note that with only 50 spp, the actual mean of forcing was pretty low. I wondered if this was the issue and so looked at histograms of the test values (see PDF) ... but am not sure.

nspp50hist.pdf

DeirdreLoughnan commented 3 years ago

@lizzieinvancouver I must be missing something, I have been trying to get the trait only model working with wider priors and it is producing large Rhat values and 200+ transitions that exceed the maximum tree depth. I tried running the model with 50 species, 50 reps, and 50 studies, but it took hours to run and produced terrible estimates.

I started plotting the test data and playing around with rstanarm instead:

I have also updated the code of both the trait and three cue models to specify the priors in the R script. The tables of model estimates now include the 25 and 75 % intervals and the plotting code is on line 138-144 of the trait code.

lizzieinvancouver commented 3 years ago

I posted to Discourse

FaithJones commented 3 years ago

Maybe this link will help https://biologyforfun.wordpress.com/2016/12/08/crossed-and-nested-hierarchical-models-with-stan-and-r/

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan Just occurred to me ... doesn't your synchrony model have species and study!? And that runs fine.

FaithJones commented 3 years ago

It looks from yesterday like our main issue at the moment is a badly fitting traits model (intercept and 2 crossed hierarchical grouping factors) Here are some potential tasks to try and figure out the Traits model:

[x] Run the simulated traits data through both our current stand model and Geoff's modified GDD model and compare the results - do the models have the same fit?
[x] Try our traits model with study effect much smaller than species effect
[x] Try an example intercept + 2 hierarchical model using example data and see if the model output also mixes badly
[x] Try with unbalanced data
[x] Try with our real data?

DeirdreLoughnan commented 3 years ago

Geoff and I ran models, using the same code to generate test data and the same priors. The code can be found here.

Both Geoff's modified GDD model and my previous traits only model produced similar estimates, a handful of transitions exceeding maximum tree depth, and low ESS. The chain mixing did not look great.

But we are now confident that both models are functionally the same.

FaithJones commented 3 years ago

I ran the model using real SLA data with 1000 warmup and 2000 iterations. The model gave warnings about Bulk Effective Samples Size (ESS) is too low. Also the mu_grand chain did not mix well. But no divergent transitions...

legault commented 3 years ago

Unbalanced test added in bab6c632

First impression is that unbalanced data leads to better n_eff

DeirdreLoughnan commented 3 years ago

Thanks @FaithJones that sounds promising!

I worked up the 1932 Olympic example. The code can be found here and Stan code here. The model runs very quickly and gives no warning messages of any kind. Interestingly, the chains still don't mix very well and a handful of the n_eff are still around 500.

@lizzieinvancouver did you want to follow up with Andrew Gelman about why the chains are not mixing very well. Or if you run out of time, should I?

FaithJones commented 3 years ago

Maybe our next step is to try running our model on simulated data where the effect of study and species are quite different? @DeirdreLoughnan did you already do this?

DeirdreLoughnan commented 3 years ago

@FaithJones I did not, but I can if you are still working on the SLA model.

FaithJones commented 3 years ago

@DeirdreLoughnan I'm currently trying to go through a paper suggested by the Discourse post person on non-identifiability in ecology.

FaithJones commented 3 years ago

Also I ran the SLA model with (I think?) fairly uninformative priors. I pushed the code to realData_traitsModel.R

trait_data <- list(yTraiti = SLAData$traitvalue, N = Ntrt, n_spec = Nspp, species = SLAData$species, study = SLAData$study, n_study = Nstudy, prior_mu_grand = 15, prior_sigma_grand = 5, prior_mu_sp = 0, prior_sigma_sp_mu = 10, prior_sigma_sp_sigma = 10, prior_mu_study = 0, prior_sigma_study_mu = 10, prior_sigma_study_sigma = 10, prior_sigma_traity_mu = 5, prior_sigma_traity_sigma = 1)

lizzieinvancouver commented 3 years ago

Thanks @FaithJones that sounds promising!

I worked up the 1932 Olympic example. The code can be found here and Stan code here. The model runs very quickly and gives no warning messages of any kind. Interestingly, the chains still don't mix very well and a handful of the n_eff are still around 500.

@lizzieinvancouver did you want to follow up with Andrew Gelman about why the chains are not mixing very well. Or if you run out of time, should I?

@DeirdreLoughnan Thanks! Could you post Stan, R and data to Discourse with your chains picture? I will email Andrew then and cc you.

DeirdreLoughnan commented 3 years ago

@lizzieinvancouver my post on discourse is here.

I also looked into whether there is similarly poor chain mixing for my synchrony model....and there is. I I look forward to hearing what Andrew has to say.

lizzieinvancouver commented 3 years ago

@DeirdreLoughnan Hmm, I just went to email Andrew and since Mike Lawrence posted I think we should try what he suggested ... add a prior on mu, a pairs plot (if easy) and just run it in rstanarm and report back. Sorry for the extra requests, but I think we are getting traction! I will ask Andrew after we answer Mike's queries (otherwise I think Andrew will suggest we do what Mike wrote).

FaithJones commented 3 years ago

I read this paper suggested by Lizzie's Discourse buddy on her post about the traits model. It was interesting to see how even a simple model could get identifiable issues. The only potentially applicable solution for us was Solution 4: "post-sweaping" (p.12). I didn't really understand this solution though I'm afraid... "This solution retains the original parameterization involving the non-identifiable intercept and random effects. However, these non-identifiable quantities are only used to compute relevant identifiable quantities that we store, monitor, evaluate, summarize, and report. There is no need to monitor or store the non-identifiable quantities, and they should not be involved in our assessment of mixing and convergence."

DeirdreLoughnan commented 3 years ago

Thanks @FaithJones for looking into that paper for us. I am not sure I understand what that means either.

I have summarized my correspondence with Andrew and Jonah regarding our intercept model here.

The main suggestion we got was to add ncp to our model and use tighter priors on our sigma values. I have started playing around with this (here is the Stan code for ncp model), but it does not seem to fix the issue with n_eff as it did for the Olympics examples. The smallest n_eff for the previous model is 242, but for the ncp model it is still only 297. The Rhat values are good for both models though. The chains are still very poorly mixing, especially for the grand mean:

Screen Shot 2021-09-16 at 2 27 37 PM

The model estimates for the ncp model are not the greatest and the hyper-parameters are also poor. Screen Shot 2021-09-16 at 3 51 24 PM

Screen Shot 2021-09-16 at 3 42 05 PM

But the pairs plot looks alright.

Screen Shot 2021-09-16 at 3 51 02 PM

DeirdreLoughnan commented 3 years ago

Our goal for the next mini retreat is to decide how best to move forward with this model. We see our current options as being:

Accept that having poorly mixing chains is not that bad as long as we can figure out how to get better n_eff and Rhat values
Try to get the model to run with very tight priors
Drop study from the model

FaithJones commented 3 years ago

@legault and I have been working on this today. We started by trying to run the real data for SLA and a subset of height. While we were doing this we found a bug with the traits only Stan code where the hierarchical effects were not quite non-centred but also not centred. Geoff fixed this, and I checked that this was not an issue with the big joint model. Our next steps are ;

[x] run the corrected model for each trait and see how important study is
[x] run models without study and compare species estimates
[x] decide if we can drop study

FaithJones commented 3 years ago

@lizzieinvancouver and I compared the plots visually, they looks similar. The model without study ran worse than with study (in terms on n_effective). We decided perhaps the best bet is to include study for now. We should:

[ ] do some prior predictive checks on a subset of height data
[ ] try running the full model for a subset of height data for now, and see how badly the model fails (using @legault 's height priors).
[ ] clean folder and add notes to the top of each script we keep
[ ] Run the phenology model with mean trait results

FaithJones commented 3 years ago

I'm closing this issue because we have moved on from the original task

lizzieinvancouver / ospree

Issues in running joint model including bad chains in two intercept model (formerly: Test the model's sensitivity to the priors) #418