rBatt / trawl

Analysis of scientific trawl surveys of bottom-dwelling marine organisms
1 stars 0 forks source link

Omega loves 1? #101

Closed rBatt closed 8 years ago

rBatt commented 8 years ago

For some reason the posteriors of Omega seem to be rather loyal to 1. Even if I have 30 species, and 100 "zero" species. I had an Omega for each year with a beta(1.5, 1.5) prior for them all. I'm going to try only having 1 Omega, and see if that helps. I don't think I really expect it to change, anyway.

I'm only worried about this because it seems like a red flag, not because the model isn't giving "good" results. I.e., I'm unsure if it's generally bad model structure, or a bug.

mtingley commented 8 years ago

Yup. Definitely a red flag. Double check code and how w is being used.

On Saturday, November 28, 2015, Ryan Batt notifications@github.com wrote:

For some reason the posteriors of Omega seem to be rather loyal to 1. Even if I have 30 species, and 100 "zero" species. I had an Omega for each year with a beta(1.5, 1.5) prior for them all. I'm going to try only having 1 Omega, and see if that helps. I don't think I really expect it to change, anyway.

I'm only worried about this because it seems like a red flag, not because the model isn't giving "good" results. I.e., I'm unsure if it's generally bad model structure, or a bug.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101.

rBatt commented 8 years ago

There isn't a w or a Z. It's because Stan can't have discrete parameters. So you have to 'marginalize' over them ... basically do the algebra that makes them drop out. Kind of a pain, to be honest.

rBatt commented 8 years ago

I'm pretty stumped on this one.

There's here: https://github.com/rBatt/trawl/blob/stan/trawlDiversity/inst/stan/msomStatic.stan#L40-L48, which is a function called here: https://github.com/rBatt/trawl/blob/stan/trawlDiversity/inst/stan/msomStatic.stan#L215-L222

And then there's here: https://github.com/rBatt/trawl/blob/stan/trawlDiversity/inst/stan/msomStatic.stan#L161

I'm modelling much of this off of a much simpler occupancy model. If I change both of the bernoulli values in that function to 0's, Omega becomes the fraction of species that are not "zero" species. So I know it's responding to both sets of Bernoulli calls related to Omega. However, for some reason the model only seems to be sensitive to the lp_available bernoulli in that function.

Even if I only have 20 real species, and pad with 500 0's, it still spots Omega at 1.00.

I played around with using different priors, too (check commits that I haven't pushed yet).

This is driving me nuts. Something this wrong has to be pretty simple.

rBatt commented 8 years ago

The problem may not be with Omega. Apparently the model is estimating Psi as being extremely low. Given that Psi is always very near 0, it could be that the model finds it very likely that all of these species are real, just very rare.

Oddly, the detection parameter seems to be more reasonable.

Below is an example histogram of Psi and Theta. I've seen a variety of posterior distributions for Theta, but the histogram for Psi almost always looks like this. Note also that I took the mean of the chain iterations, and these are the histograms pooling all the site-year-spp mean(chain).

psi_theta_problem

rBatt commented 8 years ago

I should also point out that the above histograms aren't actually on the logit scale. I used plogis() on those, but I just forgot to change the object name.

mtingley commented 8 years ago

Is this then evidence of indentifiability problems?

On Sun, Nov 29, 2015 at 11:49 AM, Ryan Batt notifications@github.com wrote:

The problem may not be with Omega. Apparently the model is estimating Psi as being extremely low. Given that Psi is always very near 0, it could be that the model finds it very likely that all of these species are real, just very rare.

Oddly, the detection parameter seems to be more reasonable.

Below is an example histogram of Psi and Theta. I've seen a variety of posterior distributions for Theta, but the histogram for Psi almost always looks like this. Note also that I took the mean of the chain iterations, and these are the histograms pooling all the site-year-spp mean(chain).

[image: psi_theta_problem] https://cloud.githubusercontent.com/assets/5630847/11458483/14e6eb16-968f-11e5-9d3b-ccf71107828f.png

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160431546.

rBatt commented 8 years ago

I'm not sure. I'm not doing the per-K estimate of Psi like before, though.

On Sun, Nov 29, 2015 at 12:07 PM, Morgan Tingley notifications@github.com wrote:

Is this then evidence of indentifiability problems?

On Sun, Nov 29, 2015 at 11:49 AM, Ryan Batt notifications@github.com wrote:

The problem may not be with Omega. Apparently the model is estimating Psi as being extremely low. Given that Psi is always very near 0, it could be that the model finds it very likely that all of these species are real, just very rare.

Oddly, the detection parameter seems to be more reasonable.

Below is an example histogram of Psi and Theta. I've seen a variety of posterior distributions for Theta, but the histogram for Psi almost always looks like this. Note also that I took the mean of the chain iterations, and these are the histograms pooling all the site-year-spp mean(chain).

[image: psi_theta_problem] < https://cloud.githubusercontent.com/assets/5630847/11458483/14e6eb16-968f-11e5-9d3b-ccf71107828f.png

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160431546.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160432402.

rBatt commented 8 years ago

I'm playing around with much smaller data sets (5 species [2 0's], 3 sites, 2 years), and I'm getting better psi histograms. Reasonable, even. Omega around 0.9. So kinda believable. And Psi is much lower for the 0 species.

I'm working through what the might mean. It's tempting to say that the model itself might be fine (programmatically, at least), but that the model structure is not very good (statistical problem). Or it could mean that it's not being fit well (not enough iterations, or need to adjust other fitting parameters for Stan).

On Sun, Nov 29, 2015 at 12:14 PM, Ryan Batt battrd@gmail.com wrote:

I'm not sure. I'm not doing the per-K estimate of Psi like before, though.

On Sun, Nov 29, 2015 at 12:07 PM, Morgan Tingley <notifications@github.com

wrote:

Is this then evidence of indentifiability problems?

On Sun, Nov 29, 2015 at 11:49 AM, Ryan Batt notifications@github.com wrote:

The problem may not be with Omega. Apparently the model is estimating Psi as being extremely low. Given that Psi is always very near 0, it could be that the model finds it very likely that all of these species are real, just very rare.

Oddly, the detection parameter seems to be more reasonable.

Below is an example histogram of Psi and Theta. I've seen a variety of posterior distributions for Theta, but the histogram for Psi almost always looks like this. Note also that I took the mean of the chain iterations, and these are the histograms pooling all the site-year-spp mean(chain).

[image: psi_theta_problem] < https://cloud.githubusercontent.com/assets/5630847/11458483/14e6eb16-968f-11e5-9d3b-ccf71107828f.png

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160431546.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160432402.

rBatt commented 8 years ago

OK, 30 sites for 2 years; 30 species, 50 "zero" species. Changed the Stan model to loop through some likelihoods iteratively, instead of using functions that I thought vectorized over the loop (looping through species, btw).

Now I get reasonable response curves for Psi and Theta (responding to temperature and day of year, respectively).

However, Omega is still hugging 1. I'm going to un-do some of the Omega optimization I tried (multiplying out by N and J a 1 ~ bernoulli(1, Omega) term; instead, I'll just do the incrementation down in the same loops where I changed psi and theta).

See below for updated histograms of Psi and Theta (across all species, years, sites; averaged among chain iterations [read: histogram and posterior mean]). psi_theta_problem

Below you can see the response curves for Psi (probability of being present) and Theta (probability of being detected) across bottom temperature and day of year (scaled); the blue lines are density plots of the X axis, thus indicating the distribution of observed environmental conditions that are used to predict Psi and Theta.

psi_theta_responsecurves

Note: Both figures include the 50 "zero" species, as well as the 30 "real" species.

rBatt commented 8 years ago

I'm pretty sure that I have this resolved. But I'm going to wait for 1 more analysis to finish before I push up the commits that fix it.

rBatt commented 8 years ago

Here's some of the figure results for a random sample of 10 strata, 5 years, and 25 species in those strata-years. Padded with 25 zeroes.

omega_full

psi_theta_problem_full

psi_theta_responsecurves_full

rBatt commented 8 years ago

In that first graph above, the blue line is the prior on Omega, and in black is the posterior.

mtingley commented 8 years ago

That's actually a pretty decent posterior on Omega. No, @rbatt?

On Mon, Nov 30, 2015 at 8:05 PM, Ryan Batt notifications@github.com wrote:

In that first graph above, the blue line is the prior on Omega.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160813692.

rBatt commented 8 years ago

Yeah, looks reasonable. But I'm sad because the code is slow again. I can't figure out how to vectorize certain parts of the code. I'm going to post an update on the Google group and hope for the best. But I think I've used up most of my requests for help :disappointed:

mtingley commented 8 years ago

Slow = slow, or slow = impossibly slow?

On Mon, Nov 30, 2015 at 8:19 PM, Ryan Batt notifications@github.com wrote:

Yeah, looks reasonable. But I'm sad because the code is slow again. I can't figure out how to vectorize certain parts of the code. I'm going to post an update on the Google group and hope for the best. But I think I've used up most of my requests for help [image: :disappointed:]

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160816132.

rBatt commented 8 years ago

35 minutes for 100 species, 5 years, 30 sites. If I do 800 species (including 0's), 90 sites (there's really 99) and 30 years (there's 32), that multiplies out to 85.5 hours or so. (I did 100 chain iterations, but it's actually mixing and converging for the most part).

Do you think I should go back to JAGS? I'm considering it. The big question is how long it would take the model to converge in JAGS. Or if it would converge.

mtingley commented 8 years ago

85.5 hours is totally fine. I have an inefficient JAGS model that currently is running that will probably end up running for about 2 weeks total. If you can do it all in 85.5 hours, then you're doing great!!

On Mon, Nov 30, 2015 at 9:03 PM, Ryan Batt notifications@github.com wrote:

35 minutes for 100 species, 5 years, 30 sites. If I do 800 species (including 0's), 90 sites (there's really 99) and 30 years (there's 32), that multiplies out to 85.5 hours or so. (I did 100 chain iterations, but it's actually mixing and converging for the most part).

Do you think I should go back to JAGS? I'm considering it. The big question is how long it would take the model to converge in JAGS. Or if it would converge.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160824565.

rBatt commented 8 years ago

I tried running it for the region last night, and it looked like it got stuck at 10%. I don't know why.

I think I need to verify that the slow model works, too.

On Mon, Nov 30, 2015 at 9:16 PM, Morgan Tingley notifications@github.com wrote:

85.5 hours is totally fine. I have an inefficient JAGS model that currently is running that will probably end up running for about 2 weeks total. If you can do it all in 85.5 hours, then you're doing great!!

On Mon, Nov 30, 2015 at 9:03 PM, Ryan Batt notifications@github.com wrote:

35 minutes for 100 species, 5 years, 30 sites. If I do 800 species (including 0's), 90 sites (there's really 99) and 30 years (there's 32), that multiplies out to 85.5 hours or so. (I did 100 chain iterations, but it's actually mixing and converging for the most part).

Do you think I should go back to JAGS? I'm considering it. The big question is how long it would take the model to converge in JAGS. Or if it would converge.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160824565.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/101#issuecomment-160826352.