start comparison of stan code

lizzieinvancouver / pmm

phylogenetic mixed models -- and we all learn together

2 stars 1 forks source link

start comparison of stan code #19

Closed lizzieinvancouver closed 1 year ago

lizzieinvancouver commented 2 years ago

@dinaramamatova

Compare new and old Stan code for the PMM:

https://github.com/lizzieinvancouver/pmm/blob/master/analyses/stan/ubermini_2.stan is the original code that I got running long ago. You can run it in https://github.com/lizzieinvancouver/pmm/blob/master/analyses/ubermini_2.R

And compare to:

https://github.com/lizzieinvancouver/pmm/blob/master/analyses/simulate_fit_oneslopeintercept_cholesky.R

which I think runs in https://github.com/lizzieinvancouver/pmm/blob/master/analyses/simulate_fit_oneslopeintercept_cholesky.R (but check, you can search on github)

Our eventual goal is to work on some simulations to compare the changes made to these models. One change is the priors (what they are and how they are fed in) which I am not too concerned about, but see what other differences you find (use a file diff or such program -- on Mac filemerge is a nice option, but there are lots) and report back on them

dinaramamatova commented 2 years ago

Hello Lizzie,

Here are some differences between two Stan codes that I found so far:

For the data block - there is not much difference when it comes to declaring the # of observations, # of predictors, vectors of predictor/response, slopes array and phylogeny matrix. However, in Deirdre’s code, she added declarations for the variables used in generating prior values of the model, as so:

In the R code to run the Stan file, she assigned the following values to this variables:

These values within phypriors list object are fed into the Stan model as a part of data argument for the R stan function alongside with the rest of data values. E.g., how the Stan model is called within Deirdre's code:
For the parameters, Deirdre added some more parameters that we would like to estimate - those are root value a_z, lambda intercept lam_interceptsa and rate of evolution intercept sigma_interceptsa. The rest of the parameters are expected to be the same in both models:

There were also some additions to Deirdre’s model block compared to your original model:

First of all, Deirdre assigned multivariate normal priors to our overall sigma variable, intercepts and slopes, as well as the variables for rate of evolution of these intercepts and slopes, based on the prior values passed to the model in the R code. She also generated beta prior distribution for the lambda variables of intercepts and slopes.

I do have a question regarding the sequence of variable assignments - shouldn’t the priors be set before any predictions are made in the model?
From the previous picture, I also saw that Deirdre created two phylogeny variance-covariance matrices vcv_a and vcv_b by making a call to the the lambda_vcv function defined at the top of Stan code, similar to the function with the same name in the original code. However, this time Deirdre used quad_form_diag(local_vcv, rep_vector(sigma, rows(vcv))) function to return a product of diag_matrix(sigma) local_vcv diag_matrix(sigma), as opposed to the product of sigma_mat lambda_mat sigma_mat in the original code. See the function definition in Deirdre's Stan code:

Vs. the function definition in your original Stan code:

Both matrices were later transformed by cholesky decomposition for more efficient numerical computing.

The response variable prediction, saved in yhat vector, looks more or less the same as before. The main difference is that we multiply b_force[sp[i]] * x1[i] by also the intercept value a[sp[i]].

dinaramamatova commented 2 years ago

I also summarized differences in the R codes to run these stan files, not mentioned yet in previous comment:

These two R scripts to run models are using different numbers of cores, thus there may be some discrepancies in performance between them - but it’s a relatively easy thing to change for both of the codes to use the same amount of cores. Deirdre’s model uses 4 cores only, while the original uses all of the cores of the machine.
Deirdre’s Stan code uses the seed twice - once in the beginning and once in the model itself, while the original code does not set any seed at the beginning and only implicitly(?) uses a seed that gets generated while the model is running.
Original code uses only 90 species, while Deirdre’s uses 200 (variable nspecies), therefore producing phylogenetic trees with different numbers of tips and internal nodes.
Deirdre’s R code uses both intercept a and slope b to estimate the response variable, while original only uses the slope values in its model. Here are the added intercept-related values for the trait parameters and the slope-related trait parameters are the same as in the original code:

Thus, later, using the intercept trait parameters lam_interceptsa and sigma_interceptsa, Deirdre generated intercept simulations for the model's phylogeny:
From the previous pictures, we can also notice how both Stan models are invoked with different number of iterations and warmups, also influencing the duration for which two models run.
Init values are not used when running stan model in Deirdre’s R code. So, I think the seed set in the beginning of her code is responsible for generating fixed random numbers. In the original code, initial values are generated by simu_inits, as we can see from picture above.
I am not sure how applicable this is, since Deirdre ended up not using the initial values generated by simuinit function. But here's the difference in functionality of this function between your codes: When generating "good" initial values: "b_force" = rnorm(n = nspecies, mean = b_z.temp, sd = sigmainterceptsb.temp) Above is the code that is used in the original code, it uses standard deviation of the value which is generated by the code in here: _sigmainterceptsb.temp <- runif(n = 1, min = 0, max = 1) If I understood it correctly, the standard deviation in some cases here may be skewed towards the smaller values, as opposed to in the case of Deirdre’s code, where the standard deviation value is set to 1 for both intercept a and b_force vectors: _a_z.temp <- rnorm(n = nspecies, mean = param[["a_z"]], sd = 1) b_z.temp <- rnorm(n = nspecies, mean = param[["bzf"]], sd = 1)
Finally, I tried running both of these models. It took me 1.14559 mins to run the original Stan code with your R file without any changes made to it. However, using Deirdre's Stan model and R code, the whole process to run the model took me 7.019836 hours of work in the background.

lizzieinvancouver commented 2 years ago

@dinaramamatova Amazing work on this!

Here are some differences between two Stan codes that I found so far:

For the priors, Deirdre and I do tend to declare them differently. I do it in the Stan code in the model block (only) and Deirdre does it so you can see them in the R code, which adds all the data block items. I don't think this should affect the speed of the code, but I guess it could not hurt to check as I never have. We'd want to make sure the models and test R code are otherwise identical.
Great catch on this! We should compare the same model. I forgot that my model has phylogeny on the slope only; we should add it to the intercept. We can aim to do that together today.
We should make sure the two sets of code have the same priors. Hopefully we can also look at those together today.
Yes! This is one change I am interested in -- I would like to test that the changes to the function are not affecting the output or speed. Let's work on code where we can compare these differences with everything else the same.
Same as 2 above; let's change this together today.

As I mentioned, the sequence of priors versus other model components in the model block should not matter.

lizzieinvancouver commented 2 years ago

These two R scripts to run models are using different numbers of cores, thus there may be some discrepancies in performance between them - but it’s a relatively easy thing to change for both of the codes to use the same amount of cores. Deirdre’s model uses 4 cores only, while the original uses all of the cores of the machine.

We should make sure all these parameters are the same across model runs.

Deirdre’s Stan code uses the seed twice - once in the beginning and once in the model itself, while the original code does not set any seed at the beginning and only implicitly(?) uses a seed that gets generated while the model is running.

We should set seed in R and within Stan, but we should try out a suite of seeds. Let's discuss.

Original code uses only 90 species, while Deirdre’s uses 200 (variable nspecies), therefore producing phylogenetic trees with different numbers of tips and internal nodes.

We should use the same across models. I think 50 species is fine to get started with and we can test maybe 100 and 150 later.

Deirdre’s R code uses both intercept a and slope b to estimate the response variable, while original only uses the slope values in its model. Here are the added intercept-related values for the trait parameters and the slope-related trait parameters are the same as in the original code

See 2 above -- we can try to fix all this today so the math of the models is the same.

From the previous pictures, we can also notice how both Stan models are invoked with different number of iterations and warmups, also influencing the duration for which two models run.

We should make sure all these model run parameters are the same across model runs.

We should also perhaps compare if the cholesky decomposition makes a difference?

lizzieinvancouver commented 2 years ago

@dinaramamatova Thanks for a great meeting today! I pushed the two files we worked on in commit a3b4b1d19643d5d4808f61fe07ba3f0c770c1078 but you'll need to update filenames and paths for them to run. Again, let me know how it is going. I am around all week and next.

We agreed to work on speed comparisons of:

How the priors are given to the model.
the two different f(x)s at the top of the Stan code
The Cholesky decomposition or not

We should start by deleting the simu_units and only include if we need. We should also make sure all the parameters, priors, and model run parameters (chains, cores, iterations, warmup) are the same and set seed in R and within the Stan command.

dinaramamatova commented 2 years ago

Hello @lizzieinvancouver,

Here is the overview of what I found so far:

The following values were the same each time I ran the model: I used 50 species to build the phylogeny tree in R code. To run the stan model from R code, I put 4 chains and 4000 iterations with the default 2000 warm ups. I also set the seeds twice inside the code - once in the beginning with the value 2021 and once in the stan function with 123456.

I adapted your model to generate vector of intercepts a based on the following initial values:

a_z = 4 # root value intercept
lam_interceptsa = 0.4 # lambda intercept
sigma_interceptsa = 0.2 # rate of evolution intercept

I got these values from Deirdre’s code in https://github.com/lizzieinvancouver/pmm/blob/master/analyses/simulate_fit_oneslopeintercept_cholesky.R.

Priors:

If comparing the speed of the model without priors fed through the Stan function, the model ran 9.96039 mins on midge server, whereas the model where I include the priors as one of the arguments to the Stan function in R code takes slightly less time 8.087443 mins with the first set of priors that I tried:

  a_z_prior_mu = 4,
  a_z_prior_sigma = 5,
  lam_interceptsa_prior_alpha = 2,
  lam_interceptsa_prior_beta = 2,
  sigma_interceptsa_prior_mu = 0.2,
  sigma_interceptsa_prior_sigma = 1,
  b_z_prior_mu = 0.6,
  b_z_prior_sigma = 1,
  lam_interceptsb_prior_alpha = 2,
  lam_interceptsb_prior_beta = 2,
  sigma_interceptsb_prior_mu = 0.1,
  sigma_interceptsb_prior_sigma = 1,
  sigma_y_mu_prior = 0.01,
  sigma_y_mu_sigma = 0.5

I used some of the values from Deirdre's code and some of the values from model specification that you used to generate the prior values in the current version of the Stan code for uberspeed. After I ran the model, I got a warning about one divergent transition after warm up, so I tried to change the priors to the ones that Deirdre used in her R code. This time it took me 8.220557 mins to run the model and I had no warnings this time.

I also checked the results of model with my originally used prior values as above and all of the the other statistics of the fit seem to be fine - R-hat values < 1.01 and ESS values > 100 * 4. So, I am not exactly sure if there is something wrong with the model or I need to pick better prior values, or this warning is small enough to ignore for now(https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup):

Even a small number of divergences after warm up cannot be safely ignored if completely reliable inference is desired. But if you get only a few divergences and you get good R-hat and ESS values, the resulting posterior is often good enough to move forward. There are also cases when a small number divergences without any pattern in their locations can be verified to be unimportant, but this cannot be safely assumed without a careful investigation of the model.

In the end, I ended up sticking to the initial values for priors that I chose and I didn't get the same warning about divergent transitions with them the next time I ran the model.

Cholesky: After I added Cholesky decomposition to the Stan code, performance seemed to worsen if the priors were also being fed into the Stan model: it degraded up to 9.834572 mins. Interestingly, this time I didn't get any warning regarding the divergent transitions, even though I used the same priors as above. What do you think could have caused the issue last time? For when the priors are being generated in the Stan code with the constant values, the model performed slightly better - 9.497299 mins, compared to the performance without Cholesky decomposition.

Function: To test the speed of functions separately, I removed the previous code of Cholesky decomposition and used the plain multi_normal to generate a and b_force. I found that the function Deirdre used in her code performs better than the function in the original model. For example, the runtime of the model without given priors, but with the new function on top of the Stan code, was 4.997349 mins - almost a half of the time it took me to run the original model with the same priors, parameters, etc. Similarly, when the model gets priors fed into it, the new model with Deirdre's functions performs better, though not to the same extent - down to 5.574244 mins.

Cholesky and function together: Adding Cholesky decomposition to the new model with the new function seemed to have degraded the performance a bit, though it is still better than if we would have chosen to run the model without the function at all - 7.093775 mins, when no priors are given, and 6.517073 mins, with the priors.

I don't know if the divergent transition warning that I have encountered earlier is a big cause of concern, but I've checked some possible fixes to it online, and one of the quick solutions is to increase the adapt_delta argument for Stan function in R. I put it up to 0.95, after which the warning seems to be gone, but the runtime of the simple model without Cholesky decomposition and changing the function with just the priors grew to 10.26628 mins, since increasing adapt_delta increases sampling time. What would you like me to do regarding the issue?

dinaramamatova commented 2 years ago

Also I looked into @DeirdreLoughnan's code that you forwarded to me and tried to make changes based on it in the uberspeed.stan, but I had some runtime errors and I also had questions regarding some of the parameters, so I decided to wait until Deirdre comes back to ask her about it.

lizzieinvancouver commented 2 years ago

@dinaramamatova This is great work! I am fascinated that how the priors are fed in makes a difference. Also, great to confirm the updated function is speeding thinks up, and surprising the cholesky does not speed up things more. We might see more of an effect if we ever scale up to more species (but we don't need to do that yet).

However, we should now scale up and see if these changes are consistent across different seeds. You'll want to do this fully scripted -- with help from me and @DeirdreLoughnan -- and push your scripts (in speedtests/) to the repo. Here's some pseudo-code of how I would do this:

Figure out what output you need to extract from each model run, and add code to do that. I would extract the mean, 25 and 75% quantiles. Something like: summary(fit)$summary[names(param), c("25%", "mean", "75%")] Next, you should extract the Rhat and number of divergent transitions ... I will add a file (stan_utility.R) with a function that helps with this, though you'll still need to add a way to extract the relevant parts of the output from the function. You also want to extract the system time of course! Get these all written in a tidy csv file at the end of a model run. I also strongly suggest the first column or row of the data (depending on how you format it) includes the true parameters and the seeds you used for Stan and for R. This will be useful for comparisons later.
Now you want to set up your code to loop over various Stan seeds (maybe 10-20 of them?), run the model, and record the output. A few hints: First, you don't need to loop over the data generation step, just the model. Second, I would do this on a small species number model so you can run and test this step quickly (maybe start with 10-20, then go back to 40-60 for your server tests). Eventually this will go on the server.
Now you need to set this up to test the prior placement, the cholesky and the f(x) ... you can do this however you see best. An easy way would be to build several different R files (and Stan files called within) -- one for each different change -- making sure the csv files write out with different names.
You can then put these all on the server (while you cannot speed up the Stan runs, you can send many jobs).
Pull the output and compare with plots!

If all this goes well, we could then add some random R seeds also, but let's start here. If you have questions I can meet tomorrow -- just let me know.

lizzieinvancouver commented 2 years ago

I don't know if the divergent transition warning that I have encountered earlier is a big cause of concern, but I've checked some possible fixes to it online, and one of the quick solutions is to increase the adapt_delta argument for Stan function in R. I put it up to 0.95, after which the warning seems to be gone, but the runtime of the simple model without Cholesky decomposition and changing the function with just the priors grew to 10.26628 mins, since increasing adapt_delta increases sampling time. What would you like me to do regarding the issue?

@dinaramamatova Divergent transitions are generally bad. You really don't want any for a model you will publish because they tell you the model did not completely explore the space well. In my experience, however, a few (<10 in models with 4000 sampling iterations) can be okay. In a case like this, getting a few seems okay. We could probably get rid of them by running more warmup and/or increasing adapt_delta -- but adapt_delta will ALWAYS slow down your code, so I don't think we need to do it here, yet.

Which is all to say -- for now, let's record divergent transition numbers, but not worry about them much.

You should worry more when other diagnostics look bad and/or you have a lot of them. When I run a model with 4000 sampling iterations and >1000 divergent transitions it usually means the model is simply coded wrong or something deeper.

lizzieinvancouver commented 2 years ago

To use the stan_utility ...

source('stan_utility.R', local=util)
# Then run your model, say the output is called fit ...
util$check_all_diagnostics(fit)

lizzieinvancouver commented 2 years ago

Functions in the utility we need here are: check_div and check_rhat

lizzieinvancouver commented 2 years ago

Plot ideas:

box and whisker (boxplot) of speed versus model type (where model type is a factor)
take the mean of each parameter and subtract true parameter value then plot versus model type for a few or all parameters
categorize (0/1) whether the true parameter value is within the 25-75% interval and then plot somehow versus model type

... assuming no major Rhat or divergent issues. Otherwise we'll have to summarize that too.

dinaramamatova commented 2 years ago

Hello @lizzieinvancouver,

I pushed my R and Stan files, that change different components of model, to the repo in https://github.com/lizzieinvancouver/pmm/tree/master/analyses/speedtests folder. Each of them use their respective stan file in stan subfolder and they create tidy csv files in https://github.com/lizzieinvancouver/pmm/tree/master/analyses/output/speedtests. It seems that all model runs have normal Rhat values and the percentage of divergent transitions do not exceed 0.026% of all iterations.

As for the speed time comparison and vizualization, I got the following:

SpeedVsModel_Boxplot.pdf

You can find the file and all other visualizations in https://github.com/lizzieinvancouver/pmm/tree/master/analyses/speedtests_viz. And here is the code that generates them: https://github.com/lizzieinvancouver/pmm/blob/master/analyses/uberspeed_test_viz.R.

It looks like swapping an old function for the new one indeed shortens the runtime of the model by a bit. Adding our own priors to the model also generally decreases the speed of model slightly (at least if looking at the mean runtime across all seeds). However, adding Cholesky decomposition for uberspeed model on 50 species did not bring any improvements to its performance, but just worsened it.

When I was trying to see how far off the mean values of parameters generated by different variations of model from the actual parameters fed into it, I found that the most accurate values were recorded for simga_y, where the mean value was off at most by -0.000190 (model with new function, with Cholesky and no priors), while the least accurate values were recorded for lam_interceptsb which was off by -0.162 to -0.168. You can find these visualizations in the https://github.com/lizzieinvancouver/pmm/tree/master/analyses/speedtests_viz/mean_vs_true_vals subfolder. Does any of the mean values seem weird to you?

I also calculated the number of model runs where the actual value of the parameter lied within 25-75% interval, and it seemed to be the same number of 40 for all of the types of model. So, in other 30 model runs, the true parameter values don't lie within this interval: https://github.com/lizzieinvancouver/pmm/blob/master/analyses/speedtests_viz/TrueParamValWithinInterval.pdf

As we can see, these 40 # of model runs seemed to be evenly distributed across 4 parameters - a_z, b_z, lam_interceptsa, sigma_y - which consistently fall within the interval for all 10 seeds and all model types. However, other parameters - lam_interceptsb, sigma_interceptsa, sigma_interceptsb - never fall within the 25-75% interval for all seeds and all model types.

lizzieinvancouver commented 2 years ago

@dinaramamatova This is fantastic work -- thank you! I am very impressed and hopefully Deirdre and I will follow up on this once we get a new version of the cholesky working (it is supposed to speed things up) and hopefully then also try it with more species.

Does any of the mean values seem weird to you?

Not really. It makes sense that sigma_y is well estimated -- it is estimated using ALL the data since it is error at the observation level (i.e., one row of data). The lam_intercepts and sigma_intercepts (both a and b) are estimates at the species level, so the model has far less data to learn them. I think it's usually harder to estimate a slope (b) than an intercept (a) but actually need to work on why that's true....

lizzieinvancouver commented 1 year ago

@dinaramamatova Thanks for all your help on this. I am closing it with an email note from Mike Betancourt (26 Oct 2022 email):

[Me] Anyway, I said I would send two things in case you were interested. I want to organize these more for you, but am not sure when the time for that will come or exactly how best to do it, so I will send them now and if you're really intrigued then I will get them ship-shape in some way useful to you: (1) Tests where we found that sending the priors in from R slowed down the model.

A figure: https://github.com/lizzieinvancouver/pmm/blob/master/analyses/speedtests_viz/SpeedVsModel_Boxplot.pdf

The Stan code with priors sent in: https://github.com/lizzieinvancouver/pmm/blob/master/analyses/speedtests/stan/uberspeed_newfunc_nocholesky_withpriors.stan

The Stan code without that: https://github.com/lizzieinvancouver/pmm/blob/master/analyses/speedtests/stan/uberspeed_newfunc_nocholesky_nopriors.stan

[Mike] Old/new function presumably refers to optimization of lambda_vcv? And cholesky/no cholesky refers to centering or non-centering the multivariate normals?

Do the box plots quantify variation in many runs with the same data and different seeds?

Passing in an external prior configuration seems to have a small effect, smaller than the variation in the box plot although systematic across all four of the other configurations. There might be some weird hardware optimizations that are blocked by having too many data variables but overall the difference seems small enough that I wouldn’t worry too much and focus on what is more productive for the analysis.