rBatt / trawl

Analysis of scientific trawl surveys of bottom-dwelling marine organisms
1 stars 0 forks source link

Implement dynamic MSOM #83

Open rBatt opened 8 years ago

rBatt commented 8 years ago

Suggested by Tingley as a good idea.

rBatt commented 8 years ago

Also, related to #27

rBatt commented 8 years ago

Sent @mtingley an email to see if he had suggestions (code, papers, models, etc). I've seen some, e.g. I think Doraizo et al. 2010 in Ecology is an example of this. However, that model doesn't have a spatial component the way our data set does (that paper has k replicates, which are different transects, but there aren't the j sites (strata))

rBatt commented 8 years ago

@mtingley @mpinsky https://groups.google.com/forum/#!topic/stan-users/PmGMnqXRD28

I brought up that topic because I wanted to make sure I could do a normal MAR model correctly (basically the phi*psi piece) before I went to a fully dynamic occupancy model.

But dang, these models are tough! Morgan, is it weird to do Psi_t = phi*Psi_t-1 + gamma*(-1 * Psi_t-1) + alpha*U? Note Psi is on logit scale, and alpha*U are all occupancy covariates and the species-specific coefficients I'm using in the non-dynamic MSOM. It just seems weird to me that you can get both the phi and the gamma terms at the same time.

These are nice ppl in this forum, from a variety of backgrounds. Feel free to post in there if you think you have something to add.

Also, Morgan, that formulation I'm using up there is based off some R/ JAGS code you sent me a while back. I can dig up that email and paste it in here for reference if you want to see where I'm getting that from.

mtingley commented 8 years ago

If I'm thinking of things correctly, then no, that's not right. Psi should not be in there at all, it should all be Z. e.g.:

Psi[t] = PhiZ[t-1] + gamma(1-Z[t-1]) + anything else Z[t] ~ dbern(Psi[t])

rBatt commented 8 years ago

When the species is known to exist,

Z = bern(psi)

In these models, Z is completely marginalized out. Because Stan can't do integer parameters.

In this formulation, you get pieces of phi and gamma each time step. How much of each depends on Psi.

On Thursday, December 10, 2015, Morgan Tingley notifications@github.com wrote:

If I'm thinking of things correctly, then no, that's not right. Psi should not be in there at all, it should all be Z. e.g.:

Psi[t] = PhiZ[t-1] + gamma(1-Z[t-1]) + anything else Z[t] ~ dbern(Psi[t])

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/83#issuecomment-163827444.

rBatt commented 8 years ago

Also, I had bad algebra there anyway: https://groups.google.com/d/msg/stan-users/PmGMnqXRD28/7oVePe8ZAgAJ

Psi_t = phi * Psi_t-1 + gamma * (-1 * Psi_t-1) + .....

Yeah, that's wrong. I guess it needs to be: Psi_t = logit(phi * inv_logit(Psi_t-1) + gamma * (1 - inv_logit(Psi_t-1) + ...) Or maybe I convert all of the Psi to 0-1 scale first. Either way, sorry for the bad algebra.

One guy on this group seem to think this type of model stands little chance of being fit well :(

We'll see though.

mtingley commented 8 years ago

Phi and gamma should be probabilities just like psi. If so, why inv-logit psi?

And yes, definitely don't multiply psi by -1!

Dynamic multi-spp models fit just fine. I'm just not sure if they well when you margenilize everything? (But I guess you have to? Still not sure how Stan works)

rBatt commented 8 years ago

Stan isn't doing anything magic. You just avoid defining Z by representing all the probabilities behind it. So, even when Z is part of a longer expression, you have to re-express everything that went into Z, basically.

I wasn't thinking of phi and gamma as probabilities. Maybe gamma. But not phi.

That said, one would expect phi to take on a value between 0 and 1. I was thinking of phi as a first-order autocorrelation coefficient. It would be negative if it tends to bounce above and below some mean value (you first subtract out the mean of the time series [the time series of Psi] so that multiplying by a negative coefficient causes it to flip sign, but when you add the mean back in at the end, it's just bouncing above and below the mean, not 0).

Assuming occupancy probability doesn't "bounce", phi would be > 0. If phi > 1, occupancy grows over time (unless other terms tend to 'drain' the occupancy probability). If, in fact, it's growing, then the time series isn't stationary, and there's no equilibrium solution. But what we do is we model out the nonstationarity by, e.g., adding in "year" as a linear continuous covariate. So we can take away the long-term trend, and then that > 1 phi should drop to be 0 < phi < 1

So yeah, 0 < phi < 1, probably. But I haven't been thinking of it as a probability, at least not in the past week or so.

Hmm. I wonder how much it matters/ if the two conceptualizations are really that different.

On Thu, Dec 10, 2015 at 11:55 PM, Morgan Tingley notifications@github.com wrote:

Phi and gamma should be probabilities just like psi. If so, why inv-logit psi?

And yes, definitely don't multiply psi by -1!

Dynamic multi-spp models fit just fine. I'm just not sure if they well when you margenilize everything? (But I guess you have to? Still not sure how Stan works)

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/83#issuecomment-163839645.

rBatt commented 8 years ago

Currently have a dynamic msom that is a multispecies + covariate + spatial heterogeneity extension of this: http://mbjoseph.github.io/2014/11/14/stan_occ.html

Of that flavor. Running it on some real data right now.

I'd have to alter the simOcc package more to get a simulation that was suitable for verifying this model. I took a look at it earlier tonight, and it will definitely require some tuning.

rBatt commented 8 years ago

@mpinsky and @mtingley and @bselden and @JWMorley --- what do you think of the following quick summary figures from a test run of the dynamic model for 100 EBS species:

psi_theta_responsecurves_full

psi_theta_problem_full

I still don't have a solid simulation for verifying the new form of the model. But here's my impression of those figures:

  1. the model ran, and it fit things
  2. I'm not impressed by the flat-line for presence probability across bottom temperatures
  3. in this model, detection was theta ~ year+doy, where year is a categorical predictor, and doy is continuous (modelled as a random variable that's aggregated among the hauls in a site-year). There appears to be substantial variation in the doy fit, so I am adding in a doy^2 term.
  4. from the bottom two panels of the histograms we have something new: phi is "persistence", gamma is "colonization". Both have pretty moderate values, and not much spread (maybe I need a broader prior? I had it a little constrained).

Also, here's some more figures. Here's alpha. Alpha are the coefficients for presence probability. The formula for presence is phi ~ btemp + btemp^2. So there are 3 coefficients, the intercept, and the two bottom temperature coeffs. Here are boxplots of the posteriors of the species-specific values for those coefficients (variability among species is modelled hierarchically, but ultimately each species still gets its own parameter value):

alpha_est_boxplots

Those look reasonable. Note that these are real species, I just haven't bothered to change the integer ID's to actually species names.

Here's a similar set of boxplots, except for Beta. The betas are like alpha, except for detectability. As mentioned before, the detection probability is theta ~ year + doy. So there are 31 parameters (30 years, so an intercept, 29 year-specific deviation from that intercept, and then the effect of doy). Let's just show the intercept, a deviation (the year 6 value, I believe), and then doy:

beta_est_boxplots

Finally, we can look at boxplots for phi (persistence) and gamma (colonization). They look uglier, but that could in part be because they're bounded at 0 or 1, and many of the values hug these extremes. Note that the earlier histogram was focused on the among-species distribution of the means of each species's posterior. This shows the posterior for each species:

phi_est_boxplots_bluedottrue

mtingley commented 8 years ago

First impressions, the model isn't fitting well for the majority of species. You're having identifiability problems where psi is going to 1 as p (Theta?) goes to 0. That's a problem...

rBatt commented 8 years ago

Yeah, phi and gamma don't seem very meaningful. The remaining question is whether those fitting bad has a negative effect on Psi and Theta.

rBatt commented 8 years ago

But I tend to agree, I don't like it. I don't really like the general concept of splitting "persistence" and "colonization" into separate processes either. I'm sure it makes a lot of sense in some instances, but in the model the way I have it, I don't think it does. And I'm not sure how to improve it either.

I'm trying to figure out if it's better to work on this model, or add more covariates to the simpler model. I'm leaning towards simpler.

mtingley commented 8 years ago

So, I don't think anything is particularly wrong with your Phi/Gamma set-up per se. When you have identifiability problems with p and psi, it's usually indicative of something else.

What's odd in the multi-species set-up is that you're not having this problem for all species, but only for some.

Looking at your plots again, which seems to be happening is that it is over-fitting the relationship between theta and day of year. You're having theta_resp that goes from 0 to 1 with a relatively sharp transition. By over-fitting detectability – and suggesting that there are times of year in which the species is entirely undetectable – you're making it impossible to actually know what psi is – which ends up leading to wacky estimates of psi (generally, if p is almost 0, then psi tends to become 1).

So the question you should be asking yourself is: why is it over-fitting day of year? Is it a closure problem? Is it a sampling problem?

Also, although you didn't plot it, I'd be interested to see the posterior histograms of your hyper-parameters, since hyper-parameters are generally far more informative than species-specific results. What's the posterior on the relationship between detectability and day of year? I imagine it's wacky wacky.

rBatt commented 8 years ago

Very interesting comments Morgan. With the day of year response curves, the blue is a density of temperature observations. One of my initial reactions was that the detectability was changing most right on the fringe of where all the samplings are happening --- it doesn't have a lot of data past 2, yet detection allegedly changes a lot past there.

I also have a year-specific intercept for detectability. Although there isn't an interaction, it might be really hard to fill in the effect of day of year, because I'm sure you need several years to start getting that full spread. Furthermore, like I said, this was still a bit of a test run in the sense that I was using 20% of the species and 20% of the strata (but almost all of the years).

I would make those plots, but I don't have the data at my fingertips. I've been trying to get this model to run on our Linux workstation for the better part of the day. I'm about to give up ...

As I'm sure you know, the big problem here is that it takes so long to test any one of those ideas. For now I'll try dropping the day or year covariate.

Basically, the model I want to run right now I ~ the model I had previously, except all the years are together, and I've added a "year" categorical predictor for detection, and a "yr" (year) continuous predictor to occurrence probability. The reasoning for the former to deal with methods changes; the latter is to soak up some potential autocorrelation in the residuals.

So I've moved back away from the 'dynamic' model, at least for now, but I also dropped day of year.

mpinsky commented 8 years ago

Another thought: day of year might be colinear with location, if the survey has similar timing and tracks each year. The whole thing takes a couple months, as I remember. Makes sense to drop it.

On Sunday, December 13, 2015, Ryan Batt notifications@github.com wrote:

Very interesting comments Morgan. With the day of year response curves, the blue is a density of temperature observations. One of my initial reactions was that the detectability was changing most right on the fringe of where all the samplings are happening --- it doesn't have a lot of data past 2, yet detection allegedly changes a lot past there.

I also have a year-specific intercept for detectability. Although there isn't an interaction, it might be really hard to fill in the effect of day of year, because I'm sure you need several years to start getting that full spread. Furthermore, like I said, this was still a bit of a test run in the sense that I was using 20% of the species and 20% of the strata (but almost all of the years).

I would make those plots, but I don't have the data at my fingertips. I've been trying to get this model to run on our Linux workstation for the better part of the day. I'm about to give up ...

As I'm sure you know, the big problem here is that it takes so long to test any one of those ideas. For now I'll try dropping the day or year covariate.

Basically, the model I want to run right now I ~ the model I had previously, except all the years are together, and I've added a "year" categorical predictor for detection, and a "yr" (year) continuous predictor to occurrence probability. The reasoning for the former to deal with methods changes; the latter is to soak up some potential autocorrelation in the residuals.

So I've moved back away from the 'dynamic' model, at least for now, but I also dropped day of year.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/83#issuecomment-164307052.

Please excuse, sent from a device with tiny keys...

rBatt commented 8 years ago

Yeah, timing is definitely related to location. But any given location has several timings, possibly months apart.

Of course, all the model sees right now is the mean DoY ± sd.

mtingley commented 8 years ago

I mean, the model as a whole is looking pretty good!

I think you're near the step where you run a full dynamic model on all data, but that you keep it relatively simple (1-2 detection covariates tops, no fancy random effects). Make it a model that is realistic enough that you could live with it but simple enough that it won't give you any big surprises.

My plan would be to run that model and see if you can get 1 paper out of it, then come back and refine the model to tell you something more complex about the dynamics.

rBatt commented 8 years ago

Run which one --- the one with or without phi+gamma?

rBatt commented 8 years ago

Right now I can't get stan to accept my full data set ... R crashing on both the workstation and on my home computer. Both have plenty of resources, but are running different operating systems. Argh!! I'm asking for help everywhere. Killin' me!

mtingley commented 8 years ago

Well..... I would say run the dynamic one (phi + gamma), but if you think you can get a paper out of just the stacked one, then go stacked. Depends on the questions... (and I forget where we stand on that).

As for not running on full data set, that's very frustrating. Have you titrated it? As in, what % have you worked it up to get it to accept? I would increase incrementally your data if you're having problems. If it can't take the whole dataset, give it as much data as you can that constitutes the 'best' data and work with that.

rBatt commented 8 years ago

Alright, I'll give the dynamic one a whirl at the same time.

The question: is diversity stable during the past 4 decades, despite large shifts in environmental conditions? If richness is stable, to what extent does the community turn over, if at all, particularly during periods of warming or cooling bottom temperatures? If richness is not stable, how does it change (direction ±, variability, etc), and what species are responsible to those changes?

It'd be good to see if changes in diversity happen to be associated with estimated thermal preferences. But more generally, I want to know if changes in richness is being driven by the same collection of organisms, or if there are new players coming in (or if old denizens heading out and not coming back).

I need to work on articulating all of that.

The data thing is infuriating. It looks like it doesn't have a problem with the full data set, but I can't run chains in parallel. It's the parallel part that keeps breaking, I guess. I decided to run 1 with the full data set but just 1 chain; I'm scared because it still hasn't finished that first iteration, and as far as I can tell, the system isn't doing anything with it.

I'm about to give up on using Stan. We'll see if I get any feedback on their GitHub issue I opened. But even that issue doesn't address why nothing is happening with the single chain right now.

rBatt commented 8 years ago

@mtingley When you say "no fancy random effects" --- what exactly do you mean? I have a random effect in the form as species-specific coefficients. So if I have "year" as a categorical predictor in the detection level, are you saying I should drop the hierarchical component of that? That seems pretty extreme. Or maybe you meant to refer to the 'random variable' I have going on with the predictors, where the covariates are modelled as parameters with highly informative priors.

rBatt commented 8 years ago

Out of desperation, I've also posted here: https://groups.google.com/forum/#!topic/stan-users/6tqw-aJSGS0

mtingley commented 8 years ago

Yeah, I think I got a bit confused.

That all sounds fine.

rBatt commented 8 years ago

I don't think the problem was Stan/ JAGS. It's related to the parallelization. I'm going to just go back to analyzing each year separately, to see if I can resolve the issue that way.