is K a replicate? "Closure" issue?

rBatt / trawl

Analysis of scientific trawl surveys of bottom-dwelling marine organisms

1 stars 0 forks source link

is K a replicate? "Closure" issue? #100

Closed rBatt closed 8 years ago

rBatt commented 8 years ago

@mtingley #99

I don't understand what is meant by "closure" in Morgan's comment on that issue. Presumably it describes whether it is a closed system or not, and thus whether repeated visits to the same site could show both presence and absence simply because a critter temporarily left.

I'm unsure if this is the same topic that I'm grappling with in the trawl data. It could be the same point, but the terminology doesn't resonate with me.

For me, it comes down to the somewhat philosophical issue of what an "independent" sample is, and what a "replicate" is. At some level, the distinction always becomes blurry. Which is why I can see Morgan saying that there may be a point here, but it's not necessarily cause for concern.

Judging by Morgan's tenor, I think he's suggesting that I aggregate environmental covariates, and treat the different K as true replicates.

As for the issue that raises for fitting the parameters, I'm not too worried. Even without "true replicates", I think it it still reasonable to parse detection and true presence/ absence. I run into quite a bit with time series modelling; in that context, one can leverage the fact that process errors propagate to the next time step, whereas observation errors do not. Here, we could say that being wrong about a true absence should mean that at a similar (but not exactly the same) temperature should have an absence too, unless the observed absence is due to a 0 on the detection.

A related point: I don't have any site-specific parameters at the moment. Not even hierarchically. I use covariates to make distinctions among sites. I wonder if I should alter that.

rBatt commented 8 years ago

Past conversation:

Morgan:

@rBatt, your efficiency seems to be improving, which is great. Just to let you know, now that you've moved from JAGS to STAN, my ability to help on coding model structure is a bit limited. If you want to talk about actual model structure (i.e., statistical structure, not code structure), just let me know and I'd be happy to set aside some time.

Ryan:

I was thinking actual model structure. Sitting down to code it up has raised some questions for me. I need to play around a bit more and get actual results from the current model before I move into trying to create the final model structure, though.

Ryan:

Also, @mtingley , if you're following that thread on the Google Group for Stan Users, you'll see that Maxwell Joseph has raised an important question. His question is very much along the lines of what I wanted to discuss with you, and likely @mpinsky would have some thoughts as well (others, too, I'm sure).

It comes down to this: if there are different tows within 1º grid during a given year, if we see a species in 1 of those tows, do we think that the species is truly present for all other tows, suggesting that detection is imperfect? Or do we think that the species is not actually present for all tows, as the true process varies among tows?

Morgan:

Well @rBatt this comes down to the fundamental question of 'closure' assumptions in Occupancy models. It's an assumption that has gotten a lot of criticism, and since the open model of Dail & Madsen was originally written for STAN, I'm not surprised that this question came up in that context. [Note: although tempting, I don't know anyone who has actually gotten the Dail & Madsen model to converge.]

So, yes, I think closure is generally a fair assumption in most systems, and even if it is violated, I don't ascribe to the group that thinks it's the end of the world. With highly migratory pelagic species perhaps it's a worrisome assumption, but given that we're using occurrence on highly abundant species of which we are taking very small random samples (i.e. tows), I think closure is fair.

Ryan:

@mtingley Opening a new issue for this discussion

mtingley commented 8 years ago

So, I don't want you to worry about it, but at some point you (@rBatt) might want to do some reading on the closure issue because it's one of the major criticisms of occupancy models. Briefly: closure assumption = the assumption in occupancy models that separates primary sampling periods (i.e., seasons or years) from secondary sampling periods. True occupancy state at a site is allowed to change across primary sampling periods, but within secondary sampling periods, occupancy is assumed to be closed. Thus, in a typical model, say you survey 3 times in June 2015 and then you come back and survey 3 times in June 2016. Year would be your primary sampling period, and the 3 visits would be your secondary sampling period. If you had a capture history of {0 1 1} for June 2015, the closure assumption means that you assume that the true occurrence state was 1 for all three visits, and that the 0 was the product of false absence. The closure assumption is necessary for MacKenzie-style occupancy models. More complicated "open" models (e.g., Dale&Madsen, and some newer non-repeat models by Lele and Solymos) do not make the closure assumption and thus allow the true occurrence status to change, even between secondary sampling periods. My comment before was that these models are incredibly hard to fit.

So what happens if closure is violated? In short, if you have temporary emigration that you treat as non-detection, you deflate your p-values and inflate your psi-values, although I think most of that comes with increased uncertainty so I'm not entirely convinced it results in inferential bias.

If you know that you have closure problems (closure is never guaranteed with non-sessile organisms), then having detectability covariates that vary for each secondary sampling period can help by absorbing some of the detectability bias that creeps in from closure violations.

As for me, I'm a bit confused by some things you said:

For me, it comes down to the somewhat philosophical issue of what an "independent" sample is, and what a "replicate" is. At some level, the distinction always becomes blurry.

So, yes, that's more or less true. Any 'single-season model' with 2ndary periods could be turned into a 'multi-season model' with no 2ndary periods because the 2ndary periods are all treated as primary periods (that's basically what the Dail-Madsen model is). But, while this is somewhat 'philosophical' it has really important consequences for your choice of model and your ease of modeling. As I said, the Dail-Madsen model is notoriously difficult to fit.

Judging by Morgan's tenor, I think he's suggesting that I aggregate environmental covariates, and treat the different K as true replicates.

Not sure what you mean.

As for the issue that raises for fitting the parameters, I'm not too worried. Even without "true replicates", I think it it still reasonable to parse detection and true presence/ absence. I run into quite a bit with time series modelling; in that context, one can leverage the fact that process errors propagate to the next time step, whereas observation errors do not. Here, we could say that being wrong about a true absence should mean that at a similar (but not exactly the same) temperature should have an absence too, unless the observed absence is due to a 0 on the detection.

Again, not entirely sure what your'e saying, but I think we're in agreement?

A related point: I don't have any site-specific parameters at the moment. Not even hierarchically. I use covariates to make distinctions among sites. I wonder if I should alter that.

No clue what the difference between a covariate and a site-specific parameter is in your context. Can you be more specific?

rBatt commented 8 years ago

OK, I get what you mean here.

I think what I've written right now does not assume closure, as Psi is defined as a per-replicate derived parameter. It's derived from the product of covariates that vary among samples-sites-years, and parameters that are specific to each covariate, and vary which also vary hierarchically among species.

Thus, estimates of presence are explicitly made on a per-sample basis. Yet, because we know that samples within a site-year will share similar values for the environmental covariates, they almost begin to act like replicate observations of the site-year. E.g., say we look back at that {1,1,0} example. Say that temperature is {9.6, 9.8, 12.2}. In that case, Psi might be {0.91, 0.89, 0.65}. In this case, we're no longer making a hard distinction between K being replicates or not. The extent to which any K (or any J for that matter) are replicates is contingent upon the algebra of the covariates and parameters for those particular samples --- when Psi ends up being very similar for any set of samples, then those samples implicitly serve as replicates for that species.

I.e., rather than asserting a binary distinction that some samples are similar and can serve as replicates, we allow "similarity" to be defined on a continuous scale by Psi, and make the notion of a "replicate" something that is derived and defined by the data and fitted parameters, not by our assertion.

Call U a matrix of covariates. It has K rows, and nU columns. Then say Psi = U %*% alpha (in R syntax). The parameter matrix alpha is nU by N, where N is the number of species. For each of the nU row vectors of alpha, we get the N different parameters by doing alpha[nU[i], 1:N] ~ normal(muAlpha[nu[i]], sdAlpha[nu[i]]). So alpha is a parameter that is species-specific, but not site-specific. U has measurements per each sample (samples imply sites, because they are nested). So the covariate is site-specific, but the parameters in alpha are not. This results in a Psi that is site and species specific.

mtingley commented 8 years ago

Incorrect, @rBatt. You definitely have a closed model (as far as I'm aware!). Because otherwise, how are you actually modeling detectability?

I just looked at the JAGS code (and maybe it's too old? correct me), but e.g. in the file msom.cov.txt you estimate psi[j,i] but p[j,k,i] which means that you are assuming psi (i.e. occurrence) is constant for each secondary sampling period, k. Whenever you have a k-loop nested within a j-loop, it's telling you that you have a closed model.

mtingley commented 8 years ago

OK, @rBatt just commented on the STAN listserve:

All this is to say that you could argue that the K are different (not really replicates, have different covariate values [both for detection and presence]), but the samples within a site are still more similar than among sites.

So, this is worrisome. Or, at least, not a traditional model and you can’t have this set-up within the Mackenzie type occupancy models.

What ‘flavor’ of occupancy model are you working off of? As in, who did you base your code off of?

If you have occurrence covariates vary at the same time scales as detectability covariates, then occupancy and detection become non-identifiable, which is why the issue of closure even exists.

rBatt commented 8 years ago

Good news is that we're now on the same page, but we're reading that page differently.

Metaphor aside, the current Stan model is different than the old JAGS model. That should clear some things up.

My current Stan model was made from scratch. It's a draft model. I designed it to contain some essential elements of the model structure, but not with the intent that it be final. It was constructed to be moderately ecologically informative, but to be a good way to learn Stan and to serve as a programming template for a better ecological model. That said, I don't think it's ecology is bad at all.

I hear what you're saying about identifiability. Let's refer to our toy example again.

X = {1, 1, 0}; // observation
U = {9.6, 9.8, 12.2}; // process covariate, call it temperature (ignoring intercept)
V = {120, 122, 123}; // detection covariate, call it day of year
// alpha is a parameter
// beta is a parameter
Psi = U*alpha; // probability present
Theta = V*beta; // probability detected
Z ~ Bernoulli(Psi); // true presences
X ~ Bernoulli(Z*Theta); // detections, likelihood

In that example, intuition would say that the detection covariate was similar for all 3 samples, but the process covariate (U) was different for the third. In X, we observed one outcome for the first 2 samples, and a different outcome for the 3rd. In this case, I'd say we got a 0 in that third sample because the critter wasn't actually there, not because we failed to detect it (the temperature was different for the 3rd sample, so Psi was different; the day of year was similar, so Theta was similar, and detection was similar).

Let's alter the toy example.

X = {1, 1, 0}; // observation
U = {9.6, 9.8, 9.9}; // process covariate, call it temperature (ignoring intercept)
V = {120, 122, 215}; // detection covariate, call it day of year

In this case, the process covariate doesn't change among the 3 samples, but the 3rd value of the detection covariate is pretty different. So I'd say that the critter was actually present all 3 times, but the 3rd time we just failed to detect

This comes down to collinearity among the process and observation covariates. If those two matrices of covariates are highly correlated, we'll have the identifiability problem you mentioned.

Obviously, even when the predictors don't covary, we still need the predictors to be good. Further, it'd require more data, possibly.

A compromise approach (I just mentioned this to @bselden ):

I can go back to calling K true replicates. In that case, the predictor covariates in U are only defined to the year-site level, and don't vary among the K samples. But how to do this when the raw temperature data come to the scientist on a per-sample level? We could take the average temperature among the K samples within a site-year, and use that average as the predictor. This is what I did in the old JAGS model, which was closed.

An improvement would be to define that covariate matrix as a stochastic node; i.e., a parameter. Specifically, it'd be a parameter with a highly informative prior. We could give it a normal prior, with the mean calculated from the data (just the sample mean from the K samples in that site-year), and a standard deviation equal to the sample standard deviation (when K=1, we could fix that element as a constant, or give it a small or 0 sd). That way, if there are 3 samples of temperature {4, 5, 6} is a very different covariate than {1, 5, 9}, even though both have a mean of 5.

mpinsky commented 8 years ago

Ryan and Morgan,

Glad to see you're working through the guts of the model. Morgan, thanks immensely for your input on model structure.

Ryan, we can talk in person if it would be helpful. In your current mode, theta and psi are equivalent and hence, identifiable. You're only (potentially) distinguishing them based on the degree to which the covariates are informative, which are not guaranteed to be correct (you may not have a good covariate for psi or for theta, and it would be better to test that than to assume it).

My recommendation is to start really simple: closed model with no covariates. Then add covariates in slowly. Average temperature seems like a reasonable place to start for psi once you get there. Jim may also have some ideas based on the seasonal work he's been doing (e.g., winter temperature), but it is likely not feasible with the data you have.

Regards, Malin

On Wed, Nov 25, 2015 at 2:51 PM, Ryan Batt notifications@github.com wrote:

Good news is that we're now on the same page, but we're reading that page differently.

Metaphor aside, the current Stan model is different than the old JAGS model. That should clear some things up.

My current Stan model was made from scratch. It's a draft model. I designed it to contain some essential elements of the model structure, but not with the intent that it be final. It was constructed to be moderately ecologically informative, but to be a good way to learn Stan and to serve as a programming template for a better ecological model. That said, I don't think it's ecology is bad at all.

I hear what you're saying about identifiability. Let's refer to our toy example again.

X = {1, 1, 0}; // observation U = {9.6, 9.8, 12.2}; // process covariate, call it temperature (ignoring intercept) V = {120, 122, 123}; // detection covariate, call it day of year// alpha is a parameter// beta is a parameter Psi = U_alpha; // probability present Theta = V_beta; // probability detected Z ~ Bernoulli(Psi); // true presences X ~ Bernoulli(Z*Theta); // detections, likelihood

In that example, intuition would say that the detection covariate was similar for all 3 samples, but the process covariate (U) was different for the third. In X, we observed one outcome for the first 2 samples, and a different outcome for the 3rd. In this case, I'd say we got a 0 in that third sample because the critter wasn't actually there, not because we failed to detect it (the temperature was different for the 3rd sample, so Psi was different; the day of year was similar, so Theta was similar, and detection was similar).

Let's alter the toy example.

X = {1, 1, 0}; // observation U = {9.6, 9.8, 9.9}; // process covariate, call it temperature (ignoring intercept) V = {120, 122, 215}; // detection covariate, call it day of year

In this case, the process covariate doesn't change among the 3 samples, but the 3rd value of the detection covariate is pretty different. So I'd say that the critter was actually present all 3 times, but the 3rd time we

just failed to detect

This comes down to collinearity among the process and observation covariates. If those two matrices of covariates are highly correlated, we'll have the identifiability problem you mentioned.

Obviously, even when the predictors don't covary, we still need the

predictors to be good. Further, it'd require more data, possibly.

A compromise approach (I just mentioned this to @bselden https://github.com/bselden ):

I can go back to calling K true replicates. In that case, the predictor covariates in U are only defined to the year-site level, and don't vary among the K samples. But how to do this when the raw temperature data come to the scientist on a per-sample level? We could take the average temperature among the K samples within a site-year, and use that average as the predictor. This is what I did in the old JAGS model, which was closed.

An improvement would be to define that covariate matrix as a stochastic node; i.e., a parameter. Specifically, it'd be a parameter with a highly informative prior. We could give it a normal prior, with the mean calculated from the data (just the sample mean from the K samples in that site-year), and a standard deviation equal to the sample standard deviation (when K=1, we could fix that element as a constant, or give it a small or 0 sd). That way, if there are 3 samples of temperature {4, 5, 6} is a very different covariate than {1, 5, 9}, even though both have a mean of 5.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/100#issuecomment-159714694.

rBatt commented 8 years ago

Yes, the identifiability of the parameters is contingent upon the predictors not being collinear, just as in a standard regression. The model can definitely work, it's just challenging to fit because of the potential for problems with identifiability (but that's not strictly an issue with the model structure, as it depends on the species and covariate data). I was aware of this.

Morgan and I had a Skype call where I think I was able to clear up my thinking.

For the record, this model is simply a structural stepping stone as I learn Stan. I did something that I thought was ecologically reasonable, that would produce Stan code that was structured in a way that would give me an idea of how fast the analysis would run, and that would also be easily adapted to more complex/ ecologically sound models.

I needed to fill in some dimensions to make sure that my code was structured in a way that was flexible --- I wanted that extra layer of complexity so that I could get another dimension on those variables.

Currently, I'm not longer specifying the true process down to the sample level.

I was thinking about modelling everything as a binomial, summing observations among samples (hauls). To incorporate the sample-level covariates, I was thinking of modelling the covariates as the normal(mean, sd) of the covariate among samples. This puts both detection and presence as binomials at the site level, which is what Bob Carpenter did in his Stan rendeition of the Doraizio model. But I don't want to bicker about identifiability because I can't afford to do another suite of simulations again, so I probably won't try it.

rBatt commented 8 years ago

Want to clarify 2 things:

1) With regard to "starting simple and slowly adding covariates", I am actually starting simple here. This model doesn't include a lot of the dynamic components, which means I don't have endogenous covariates (~ no autoregressive-ish terms). I realize simpler models exist, but I wanted to include the extra just to confirm a working model structure (needed a matrix, not a column vector; this is important in Stan). And from a programming perspective, I think I made the right choice because it took me about 2 minutes to debug my first draft of the model (first model written in Stan, ever, and it was written from scratch). Because I don't plan on making any ecological inferences from the model, I'm not really concerned with how complex it is to interpret the parameters.

2) More on the ecological side, I kinda stumbled onto modelling the presence at the sample level. I just did it that way b/c it's how the data were structured, so I figured I give it a try (it was easier to do it at the same level). I'm only defending the approach as a matter of principle (it's not a wise choice, but it's no intrinsically flawed), not because I think it's the way to go.

3) My previous comment about the "not wanting to bicker" should be read more as "not wanting to fiddle". Sorry. And I'm honestly a little confused about the model that Bob Carpenter coded up. They don't have any covariates, yet both presence and detection are modelled at the site level, as a binomial. I think what's going on is that the presence part of the LP (log probability) is only incremented once per site and is handled as bernoulli, whereas the detection piece is modelled as a binomial (with K samples) at the same time. I think I'm figuring it out, but my point is that I'm honestly still trying to figure it all out. The representation is just a little more complicated than what I'm used to b/c of the marginalization.

rBatt commented 8 years ago

I've figured out 3. And in so doing, realized I made a mistake in my most recent model. The correction for the mistake basically means I have to re-arrange a bunch of loops and functions in the model. At that point, I might as well reformulate the detection process as a binomial. I'll put the detection covariates as parameters, which have a normal prior whose hyperparameters are just the mean and sd among samples within a site.

Worst case scenario is that this will be a substantial speed up that avoids problems with huge K (97 hauls) per site-year in some cases, but it makes the covariates not very useful. This isn't that bad, because my old model didn't even have detection covariates.

Best case scenario, we get all of those performance boosts listed above, and the covariates still work pretty well.

If you're wondering "what in the hell is he doing making the covariates parameters?", I'd like to point out that this is kinda just like saying you only have detection covariates at the site level, and doing a model where the predictor variables are prone to measurement error.

Why not just take the mean of the predictors? Well, say we have a species who really only sure up if the survey is sampled between DoY 95 and DoY 105. Say that in one year sampling occurs on {96, 100, 104}, and another it occurs {80, 100, 120}. In the first case, the predictor would be modeled as ~normal(100, 4), and in the second as ~normal(100, 20). The first distribution has 79% of its mass between 95 and 105, whereas the second has 20% of its mass between 95 and 105. So this method indicates that the first set is much more favorable than the second for detection (obviously, the exact %'s are off b/c the sets aren't normally distributed/ only 3 samples). So it's not as good as knowing the truth, but it's a lot better than just using the mean.

rBatt commented 8 years ago

Closing for now, because I've addressed this. The solution I came up with (kind of a compromise) was, as mentioned above, to handle sample-specific process covariates (the U in Psi = U%*%alpha) as though they were random variables at the site level.

So each site-year might have K samples. Instead of having K values for a covariate in U during that site-year, I have a U_mu with the mean of the covariate in the site-year, and a U_sd the the sample (statistical sense) sd. I believe this is what Bob Carpenter refers to as "sufficient statistics" (not the random variable part, just the specifying by the sample statistic). Anyway, I then specify U ~ normal(U_mu, U_sd). That way I capture some of the sample-specific variability without computational or inferential/ conceptual costs of having process modelled at the same level as the detection.

Further, all my code for this is now extremely flexible, and it is no problem to switch a covariate from being a RV to a 'constant'. There's only 1 Stan model for this, even!