stan-dev / example-models

Example models for Stan
http://mc-stan.org/
776 stars 478 forks source link

BPA Ch.13: Incorrect `occ_fs` #99

Open mikemeredith opened 7 years ago

mikemeredith commented 7 years ago

(I'm a Stan newbie, but I have played with occupancy models a lot, including the WinBUGS and JAGS code in BPA.)

These are occupancy studies, and occ_fs is the number of sites in the sample which are occupied. As such, it cannot be less than the number of sites observed to be occupied (occ_obs) or greater than the total number of sites in the sample (R). The example code allows these constraints to be violated; it's just (bad!) luck that this doesn't happen with the example data from BPA.

A key idea is the conditional probability of occupancy, psi_con, conditional on the data. So if the species is not detected, we know the site is in the yellow area in the diagram below. venn_bpa_ch13

Model site_occ.stan

The simple model has one value for psi and one for p, so all sites with no detections have the same value for psi_con, called psi_nd in the code below:

generated quantities {
  int<lower=occ_obs, upper=R> occ_fs;
  real psi_nd;            // prob present | not detected
  psi_nd = (psi * (1 - p)^T) / (psi * (1 - p)^T + (1 - psi));
  occ_fs = occ_obs + binomial_rng(R - occ_obs, psi_nd);
}

Model site_occ_cov.stan

Here each site has a different psi and p, and a good strategy is to calculate psi_con for each, which is in any case a result you may want to monitor.

generated quantities {
  int occ_fs;       // Number of occupied sites
  real psi_con[R];  // prob occupied | data
  int z[R];         // occupancy indicator, 0/1

  for (i in 1:R) {
    if (sum_y[i] == 0) {  // species not detected
      real      psi = inv_logit(logit_psi[i]);
      vector[T] q = inv_logit(-logit_p[i])';  // q = 1 - p
      real      qT = prod(q[]);
      psi_con[i] = (psi * qT) / (psi * qT + (1 - psi));
      z[i] = bernoulli_rng(psi_con[i]);
    } else {             // species detected at least once
      psi_con[i] = 1;
      z[i] = 1;
    }
  }
  occ_fs = sum(z);
}

Model bluebug.stan

The same strategy can be applied here, this time allowing for the variable number of visits to each site. So we replace

vector[T] q = inv_logit(-logit_p[i])';

with

vector[last[i]] q = inv_logit(-logit_p[i, 1:last[i]])';

Other models

I haven't looked closely at the dynamic models, but it appears that the reported z and n_occ values are also based on unconditional psi values. Of more concern are the closed-capture models in Ch.6, where the analogous value, the population size N, appears to have the same problem; this has a smaller numerical impact, as the denominator of the conditional inclusion is always large (ie, close to 1), but should be fixed.

bob-carpenter commented 7 years ago

@ito4303 Would you mind looking at this?

@mikemeredith Thanks for the carefully constructed issue. Is this a problem with the BUGS code in the original book or just an issue with the translation?

I have a complete replication of Dorazio and Royle's occupancy model here:

http://mc-stan.org/documentation/case-studies/dorazio-royle-occupancy.html

I hope that's a faithful translation of their cited paper, but if not, please let me know, as I understand that model better than the BPA models. In the Dorazio and Royle approach, the result gets bounded between observed and max (pseudo-sites) because you start with the occupied ones (they're observed) then add in expected occupance of other ones, which has to be between 0 and 1. So we don't need to declare constraints to enforce the constraint.

mikemeredith commented 7 years ago

Exceeding the constraints is a symptom, not the problem; enforcing constraints isn't the solution but is a check. The problem is that, in the call to binomial_rng, N is too big and theta is too small. That will often give a plausible result, but is still wrong.

In the BUGS code, they use a latent binary variable, z[i], for occupancy for each site, and then occ_fs = sum(z). They don't need to do the kind of posterior predictive thing. So it's an issue with the translation.

If I were really using Stan (instead of translating), I'd get the conditional (on the data) probability of occupancy for each site and sum those to get an expectation for occ_fs.

I'll look at the Dorazio et al stuff later, but what you describe is I think the right way to do it and what's needed for the BPA Ch.13 examples and the closed-capture examples in Ch.6.

ito4303 commented 7 years ago

@mikemeredith I greatly thank you for pointing out the problems. As you pointed out, they are incorrect translations.

I will contribute corrected models as soon as possible. Or, perhaps you can create pull requests with the corrected code.

mikemeredith commented 7 years ago

@ito4303 I'll do pull requests for the models I've looked at so far. Give me a day or 2.

bob-carpenter commented 7 years ago

Great! Thanks.

On Mar 17, 2017, at 10:50 PM, mikemeredith notifications@github.com wrote:

@ito4303 I'll do pull requests for the models I've looked at so far. Give me a day or 2.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mikemeredith commented 7 years ago

@bob-carpenter There are issues with the write-up of the Dorazio & Royle analysis, but I think the math is all ok. Where would be a good forum to discuss this?

bob-carpenter commented 7 years ago

An issue in this repo---the source is in example-models/knitr/dorazio-royle-occupancy/

mikemeredith commented 7 years ago

@bob-carpenter I've done a pull request.