Open tomwallis opened 8 years ago
@tomwallis Glad you're interested in contributing. @bgoodri have you seen these models before? I'm not familiar with them but it would be nice to be able to fit them if they're useful. And it shouldn't be too hard to do if it's just a link function. It's not the simplest process, but adding just a new link function will be less work than adding a new modeling function. @bgoodri and I can answer any questions you have along the way.
I think the changes in PR #62 (which will be merged soonish but should already be usable) will make adding new link functions easier. We're adding an rstanarm_family
function as an alternative to specifying family
the traditional way, so you could have something like this:
stan_glm(..., family = rstanarm_family("binomial", link = "mafc_logit", lower = 0.5, upper = 0.95))
There's doc for rstanarm_family
here.
As for the bernoulli.stan
file, if really the only thing new is just the link function then I think all you'd need to do is add to the linkinv_bern
, ll_bern_lp
, pw_bern
functions defined at the top (the lower and upper bounds would also need to get declared in the data block I guess). The link
argument is an integer indicating the link function to use. You can just use one greater than the largest integer currently used for a link function. The "mafc_logit"
link would then need to get mapped to that integer in stan_glm.fit
(we can help you navigate stan_glm.fit
or do that part for you).
Also, the Adding a new model and Contributing to development pages on the wiki might be useful, although they're only partially written (still working on those). The Adding a new model will eventually be a full example of adding an entirely new modeling function, but some of the content applies to your case too.
Hope that helps get you started. Let me know if I forgot to address any of your questions.
Yeah, squeezing the probabilities to some subinterval of (0,1) wouldn't be that hard, even if the bounds were unknown. It would be easier to implement once the rstanarm_family
thing is merged. But anything that is added to exec/bernoulli.stan
should also be added to exec/binomial.stan
.
I realised yesterday that the brms
package allows these bounds to be nonzero, and even (I think) estimated from data, using the nonlinear
syntax in that package.
Since brms
seems to cover my use case, I'm happy to leave it up to you whether you still think it's worthwhile having squeezed bounds implemented in rstanarm
. I'm getting the sense that rstanarm
is focused on providing pre-compiled common models, whereas brms
requires compilation but offers more flexibility in model specification (with perhaps associated convergence problems) than is currently in rstanarm
. Is that a fair assessment?
That's fair, but squeezing the probabilities wouldn't be too difficult or interfere with anything else we are doing with rstanarm.
You could start by putting something like this into the transformed data
block:
int<lower=0,upper=1> has_lower;
int<lower=0,upper=1> has_upper;
And you can temporarily set those to 0 or 1 for testing (which unfortunately requires a recompile).
Then in the parameters
block, include
real<lower=0,upper=1> lower[has_lower];
real<lower=maybe_lower(has_lower, lower), upper=1> upper[has_upper];
where maybe_lower
is a function in the functions block like
real maybe_lower(int lower, real[] lower) {
if (lower == 1) return lower[1];
else return 0;
}
Then make a function in the functions
block that evaluates the log-lkelihood both for the sample and for each point when those bounds exist. Once you get the Stan part of it going, we can figure out the best way to pass has_lower
and has_upper
from R to the data
block rather than defining them in the transformed data
block.
Great, thanks for the pointers on implementation. However, given that my priority right now is to model my data (and brms
appears to let me do that), I'm not sure when I will be able to spend time contributing the above to rstanarm
. Thanks for the discussion, and I will monitor developments.
Summary:
I would like to add some custom binomial link functions for rstanarm; I'm seeking guidance to do that.
Description:
In psychophysics and neuroscience, it is common to use forced-choice experimental designs that result in a nonzero lower bound for binomially-distributed data. That is, the data will asymptote to something like 0.5 or 0.25 rather than 0 (see further example below). It would be useful if one could include custom versions of the binomial family link functions that show this behaviour. Specifically, the custom inverse link functions take the standard binomial inverse link functions (e.g. the logit) and squish them into a new range:
In a task with two possible responses (see below), the lower bound would be fixed at 1/2. In a task with four possible responses, the lower bound would be 1/4. Sometimes it is even desirable to have the lower and upper bound be free parameters (i.e. fitted to data), but I expect this would require a major recoding of the relevant rstanarm functions (because these free parameters would be outside the linear predictor of the GLM / GLMER). So I being able to set the lower bound (and maybe the lapse rate) to nonzero values would fulfil the bulk of use cases here. Obviously this changes how one can interpret the linear predictor coefficients (e.g. they won't be log odds any more for the logit link). Nevertheless, this would be a very useful addition to rstanarm for people like me. I would envisage it being added to
stan_glm
andstan_glmer
at this stage.Steps to add new link functions?
In this related issue, @jgabry mentioned he could outline the steps needed to add a new link function. I've forked the repo and started browsing bernoulli.stan, but I'm a little lost in the internals of the package. Some high level guidance would be useful. Also, what tests / docs should I aim to provide for a successful pull request?
Envisaged user interface
The above example would use an
mafc_logit
link function (mAFC stands for "m alternative forced-choice") withlower
set to 0.5 andupper
set to 0.95, resulting in a logistic link function whose inverse link spans from 0.5 to 0.95.I'm happy to take changes on this interface. I'm also not sure whether this envisaged use would require adjustment to the internal
validate_family
function.More intuitive example:
Imagine I show you a screen that contains an image on one side and nothing (grey screen) on the other. Your task is to tell me which side the image is on ("left" or "right"). Over consecutive trials the image is presented randomly to the left or right. We can recode each response to be "correct" or "incorrect", and we usually assume that responses over many trials are independent and binomially distributed.
If I make the image very faint you won't be able to see it. On average, your performance (proportion correct) will be 0.5. These link functions allow one to fit a sigmoid that asymptotes to this theoretical chance performance level (0.5 in this case) rather than to zero (like a standard binomial link function). Though real data could fall below 0.5 (due to measurement noise, or the observer mistakenly flipping responses), new data would be expected to fall no lower than 0.5 on average. There is a useful book that describes approaches to fitting this data in R by Knoblauch and Maloney.