Custom binomial link functions for rstanarm

tomwallis commented 8 years ago

Summary:

I would like to add some custom binomial link functions for rstanarm; I'm seeking guidance to do that.

Description:

In psychophysics and neuroscience, it is common to use forced-choice experimental designs that result in a nonzero lower bound for binomially-distributed data. That is, the data will asymptote to something like 0.5 or 0.25 rather than 0 (see further example below). It would be useful if one could include custom versions of the binomial family link functions that show this behaviour. Specifically, the custom inverse link functions take the standard binomial inverse link functions (e.g. the logit) and squish them into a new range:

yhat = lower_bound + (1 - lower_bound - lapse_rate) * inv_logit(eta)

In a task with two possible responses (see below), the lower bound would be fixed at 1/2. In a task with four possible responses, the lower bound would be 1/4. Sometimes it is even desirable to have the lower and upper bound be free parameters (i.e. fitted to data), but I expect this would require a major recoding of the relevant rstanarm functions (because these free parameters would be outside the linear predictor of the GLM / GLMER). So I being able to set the lower bound (and maybe the lapse rate) to nonzero values would fulfil the bulk of use cases here. Obviously this changes how one can interpret the linear predictor coefficients (e.g. they won't be log odds any more for the logit link). Nevertheless, this would be a very useful addition to rstanarm for people like me. I would envisage it being added to stan_glm and stan_glmer at this stage.

Steps to add new link functions?

In this related issue, @jgabry mentioned he could outline the steps needed to add a new link function. I've forked the repo and started browsing bernoulli.stan, but I'm a little lost in the internals of the package. Some high level guidance would be useful. Also, what tests / docs should I aim to provide for a successful pull request?

Envisaged user interface

fit <- stan_glm(y ~ x1 + x2, data = dat, 
                 family = binomial(link = "mafc_logit", lower = 0.5, upper = 0.95))

The above example would use an mafc_logit link function (mAFC stands for "m alternative forced-choice") with lower set to 0.5 and upper set to 0.95, resulting in a logistic link function whose inverse link spans from 0.5 to 0.95.

I'm happy to take changes on this interface. I'm also not sure whether this envisaged use would require adjustment to the internal validate_family function.

More intuitive example:

Imagine I show you a screen that contains an image on one side and nothing (grey screen) on the other. Your task is to tell me which side the image is on ("left" or "right"). Over consecutive trials the image is presented randomly to the left or right. We can recode each response to be "correct" or "incorrect", and we usually assume that responses over many trials are independent and binomially distributed.

If I make the image very faint you won't be able to see it. On average, your performance (proportion correct) will be 0.5. These link functions allow one to fit a sigmoid that asymptotes to this theoretical chance performance level (0.5 in this case) rather than to zero (like a standard binomial link function). Though real data could fall below 0.5 (due to measurement noise, or the observer mistakenly flipping responses), new data would be expected to fall no lower than 0.5 on average. There is a useful book that describes approaches to fitting this data in R by Knoblauch and Maloney.

jgabry commented 8 years ago

@tomwallis Glad you're interested in contributing. @bgoodri have you seen these models before? I'm not familiar with them but it would be nice to be able to fit them if they're useful. And it shouldn't be too hard to do if it's just a link function. It's not the simplest process, but adding just a new link function will be less work than adding a new modeling function. @bgoodri and I can answer any questions you have along the way.

I think the changes in PR #62 (which will be merged soonish but should already be usable) will make adding new link functions easier. We're adding an rstanarm_family function as an alternative to specifying family the traditional way, so you could have something like this:

stan_glm(..., family = rstanarm_family("binomial", link = "mafc_logit", lower = 0.5, upper = 0.95))

There's doc for rstanarm_family here.

As for the bernoulli.stan file, if really the only thing new is just the link function then I think all you'd need to do is add to the linkinv_bern, ll_bern_lp, pw_bern functions defined at the top (the lower and upper bounds would also need to get declared in the data block I guess). The link argument is an integer indicating the link function to use. You can just use one greater than the largest integer currently used for a link function. The "mafc_logit" link would then need to get mapped to that integer in stan_glm.fit (we can help you navigate stan_glm.fit or do that part for you).

Also, the Adding a new model and Contributing to development pages on the wiki might be useful, although they're only partially written (still working on those). The Adding a new model will eventually be a full example of adding an entirely new modeling function, but some of the content applies to your case too.

Hope that helps get you started. Let me know if I forgot to address any of your questions.

bgoodri commented 8 years ago

Yeah, squeezing the probabilities to some subinterval of (0,1) wouldn't be that hard, even if the bounds were unknown. It would be easier to implement once the rstanarm_family thing is merged. But anything that is added to exec/bernoulli.stan should also be added to exec/binomial.stan.

tomwallis commented 8 years ago

I realised yesterday that the brms package allows these bounds to be nonzero, and even (I think) estimated from data, using the nonlinear syntax in that package.

Since brms seems to cover my use case, I'm happy to leave it up to you whether you still think it's worthwhile having squeezed bounds implemented in rstanarm. I'm getting the sense that rstanarm is focused on providing pre-compiled common models, whereas brms requires compilation but offers more flexibility in model specification (with perhaps associated convergence problems) than is currently in rstanarm. Is that a fair assessment?

bgoodri commented 8 years ago

That's fair, but squeezing the probabilities wouldn't be too difficult or interfere with anything else we are doing with rstanarm.

You could start by putting something like this into the transformed data block:

int<lower=0,upper=1> has_lower;
int<lower=0,upper=1> has_upper;

And you can temporarily set those to 0 or 1 for testing (which unfortunately requires a recompile). Then in the parameters block, include

real<lower=0,upper=1> lower[has_lower];
real<lower=maybe_lower(has_lower, lower), upper=1> upper[has_upper];

where maybe_lower is a function in the functions block like

real maybe_lower(int lower, real[] lower) {
  if (lower == 1) return lower[1];
  else return 0;
}

Then make a function in the functions block that evaluates the log-lkelihood both for the sample and for each point when those bounds exist. Once you get the Stan part of it going, we can figure out the best way to pass has_lower and has_upper from R to the data block rather than defining them in the transformed data block.

tomwallis commented 8 years ago

Great, thanks for the pointers on implementation. However, given that my priority right now is to model my data (and brms appears to let me do that), I'm not sure when I will be able to spend time contributing the above to rstanarm. Thanks for the discussion, and I will monitor developments.

stan-dev / rstanarm