Closed paciorek closed 6 months ago
@paciorek Had a peak and it's looking overall great. I'll make some time tomorrow and will give it a full check and play around with some examples. Great work getting it all cleaned up.
Good, thanks @paul-vdb . If you can pass your examples on to me once you're done, they may help me in setting up tests.
@paciorek I started checking this through. Looks like you started changing function names from 'get' to 'set' (better name!) but hadn't completed that. I pushed those changes.
I'll look through more thoroughly tomorrow. I've written an N-Mixture model to test the stochastic binomial size. I see in the sandbox we have tests for a multivariate normal beta, standard logistic regression, zero-inflation through occupancy. Have you run this with the MSOM as blocks as previously discussed on the other thread?
Actually, I was torn about get
vs. set
. For the design matrix, set
seemed right, while for the others get
seemed more appropriate even though they are setting into member objects rather than returning the "gotten" info.
As far as the sandbox tests, thanks for reminding me of those. I will copy them over to the new nimble branch when I start working on tests.
As far as blocks, I actually decided that was a rabbit hole/red herring we shouldn't go down now. We'll expect that a user would assign PG as a block to all parameters involved in a given set of response values.
@paciorek I actually think 'set' is good because those functions cache and don't actually return anything to 'get'.
Ok, I'm fine with set
.
@paciorek I've reinstalled the branch but the PG sampler won't compile. It runs fine in R. Was it compiling locally for you? I tried two different examples. I got this error message both times. Not really sure where it's falling apart. Any ideas?
Sorry, I didn't fully check things. Can you send me your example and I will debug?
Will push the changes I was just making now. On checking whether singleSize == TRUE. If stochastic size, then I think we don't need to check singleSize at all. We also only need to run it that one time.
set.seed(1)
lambda <- 6
beta0 <- 0.5
beta1 <- 0.15
J <- 20
K <- 10
type <- rep(0:1, each = J/2)
N <- rpois(n = K, lambda = lambda)
y <- matrix(NA, nrow = K, ncol = J)
for(i in 1:K){
y[i,] <- rbinom(n = J, size = N[i], prob = expit(beta0 + beta1*type[i]))
}
code <- nimbleCode({
beta0 ~ dnorm(0, sd = 10000)
beta1 ~ dnorm(0, sd = 10)
lambda ~ dgamma(1,1)
for (k in 1:K) {
p[k] <- expit(beta0 + beta1*type[k])
N[k] ~ dpois(lambda)
for(j in 1:J) y[k,j] ~ dbinom(size = N[k], prob = p[k])
}
})
# Polya gamma sampler:
mod <- nimbleModel(code = code, constants = list(J=J, K=K, type=type),
data = list(y = y, N = rep(NA, K)), inits = list(beta0=0, beta1=0, lambda=5, N=apply(y,1,max)+5))
conf <- configureMCMC( mod, monitors = c('beta0', 'beta1', 'lambda') )
conf$removeSamplers( c("beta0", "beta1") )
conf$addSampler( type = "sampler_polyagamma", target = c("beta0", "beta1") )
Rmcmc <- buildMCMC(conf)
Cmod <- compileNimble(mod)
Cmcmc <- compileNimble(Rmcmc, project = mod)
Ok, I've merged your changes with some of mine, including that we can't use setSize
because that is a function name we use in the DSL and so it is causing a conflict.
As far as singleSize and stochSize, I agree it seems silly to check for single size if we have, say n[i]
. But if we have a single size that is stochastic, such as dbin(n, p[i])
, it might save some computation. However I don't know if that is a realistic use case so not inclined to do anything about that for now.
@paciorek good point. I suppose you could have a single stochastic size. Is it slower to repeatedly loop to check size in that case, or to keep size as a vector? I think it's splitting some hairs there for sure.
My fault for breaking the code when renaming the methods. I'll have another look at it today. Thanks
@paciorek I ran a couple of models, read through what you've changed and overall, I think it looks really good. I assume that we will find efficiencies in the future, but it seems really usable as it is. I didn't see any documentation. Examples I ran were N-Mixture (Stochastic Size), random effects logistic regression (stochastic design) with variations of the latter where I applied a random walk to the random effects and polyagamma to the fixed effects. Mixing the samplers didn't break anything and that was good to! It will be interesting to see this on bigger data problems where any inefficiencies would be more visible.
@paul-vdb @perrydv @danielturek @kenkellner This adds an extensive set of tests as well as documentation for the Polya-gamma sampler.
A few comments:
MCMC_samplers.R
. More eyes would be great, particularly from Paul. I didn't add anything to the manual as we don't generally describe all our samplers, but open to other thoughts on that.control$nonTargetNodes
, if there are any stochastic nodes in the linear predictor that are not being sampled as part of the PG sampler. This would include stochastic indexes. With more work, we could probably deal with this in a more automated fashion, but I think it's ok for now.z[i]*expit(b0+b1*x[i])
is not allowed. This is because of how I do checking for the logit link. Again with more work, we could probably deal with this more elegantly but I think it is ok for now.@paciorek
Looks great overall. I couldn't see anything wrong with it and you've highlighted the sticky points above. The documentation looks clear (at least to me who is familiar) but comments from someone less familiar might be helpful.
Minor comment: In the future it might be good to put the get_param functions in another file for more general use. I'll want to use them with Laplace for prior Hessian computation in the future.
I think on the checks you've given the user enough options to choose not to check and just run it on faith. Such as when they've written it as z[i]*expit(b0+b1*x[i])
it seems like they can just check = FALSE and so we do actually allow it, just can't flag it as okay right now.
I read through the tests you provided, and they look well thought out and cover most of the key pieces.
@paciorek
The docs were very clear to me as someone less familiar with PG. In my experience, specifically from an occupancy perspective, I don't think I've seen anyone use a z[i]*expit(b0+b1*x[i])
style declaration so I doubt that will be a major issue.
@paul-vdb I agree about moving the get_param
functions. That said it shouldn't be in the Laplace files because those will be moved out into nimbleQuad eventually. I'll think about where they might better belong.
@paciorek I've had a pretty good look at this.
model$getDistribution(target)
is done twice. Often those are slow steps in big use cases so we want to retain the results from one and only one call. In this specific example it probably doesn't matter. Might be worth scouring for other repeats.probNode
. Is this reliable?stochSize
(example: maybe the sizes are stochastic in the model structure but the user knows they are not being sampled and so are not stochastic for the PG sampler's purposes), zeroInflated
(are there any use cases where it would be simpler or more general to just let a user indicate this?)model$calculate
is needed. As we've seen that is a meaningful computational cost and often will not be necessary.I realize some of these may be not worth time in the immediate crunch.
I also put in some timers for various steps, and here is what I came up with:
Regarding setup efficiency, I cleaned up the case you pointed out and another one I found. If you notice anything else specific, LMK.
Regarding reliabilility, the idea was to only check one because otherwise the checking could become quite burdensome for large models like our occupancy examples. I thought I had protected things in terms of it being ok to just check the first case by checking all the response nodes are from the same declaration, but I realized that really it's rather more complicated. I've added a bunch more checking that aims to ensure that one can just use the first case for the checking. It's all getting rather involved, which feels dangerous/finicky, but I don't see another way of avoiding making the setup quite time-consuming.
Regarding more user control, given our time crunch, I'm inclined to leave them aside. In particular allowing users to override calculation feels dangerous as it depends on knowing the implications of the exact sampler ordering. We could make the same argument about conjugate samplers too...
Single iterations are done because those tests just check initialization of designMatrix and related things so we just need to make sure the initialization code runs.
As far as timing, that is consistent with the timing I got for our occupancy examples. I'd be curious about your run-time efficiency ideas as I put some effort into improving efficiency already.
Thanks. The reason I was thinking it might not be too dangerous to give an option to skip the calculation at the end is that in a typical case the "y[i]" will be data and there shouldn't be anything else that depends on any of the parameters (with the possible exception of posterior predictive nodes). But you make the final call for this release.
Here is maybe a more noticeable thing: Is the restriction to >= 2 parameters due to the vector/scalar hassles? If so I could probably insert the "one_time_fixes" approach and allow it to work for 1 parameter as well. How important?
Regarding skipping the calculate
-- what happens when we then go to sample parameters in the prior of the PG target nodes. They won't access the correct logProb for their dependencies (the PG target nodes). I suppose you were thinking we could run calculate
just on the target nodes at the end rather than on the full calcNodes
? Or what about if this is being done on occupancy linear predictor parameters. Then when the zero inflation ("z") is sampled via our binary sampler, it won't have the correct logProb values. I might be missing something.
Regarding >= 2 parameters, yes that is vector/scalar hassles, perhaps most importantly in the matrix/dmnorm calcs involved in the conjugate update for the target.
@perrydv I'm inclined to merge this and not deal with scalar target case as that's not just a one-time fixes thing, it is adding run code to do the math with dnorm
.
I added handling of scalar parameter. Good thing I did as I also realized we weren't handling scalar/vector issue related to a single (vector) target in terms of nodeLengths
and normTypes
potentially being scalars.
Merging this as test failures appear to be unrelated.
Do not merge. This is for review. I need to add testing (both of sampling correctness and of the checking that model functional form is ok for PG).
This PG sampler allows for:
This takes the original sampler code from @paul-vdb and reworks some of the checking to try to catch more cases where PG is not appropriate as well as trying to improve efficiency.
In terms of checking, we use some of our conjugacy checking code to check that the response probabilities are the product of a probability times one or more zero-inflation binary values and that the probability is logit-linked to a linear predictor of the target elements.
We manually fill in the X (design) matrix (in similar fashion to how we handle conjugacy in regression contexts) if not provided by the user but only do this once of the user tells us X is fixed (i.e., not missing values in X and no stochastic indexing in terms of which row of X is relevant for a given response.
@paul-vdb would be great if you could take a careful look, but I understand you may have time limitations.
@perrydv @danielturek @kenkellner just flagging this for you if of interest.