rBatt / trawl

Analysis of scientific trawl surveys of bottom-dwelling marine organisms
1 stars 0 forks source link

Poor structure of `psi` equation in MSOM?? #92

Closed rBatt closed 8 years ago

rBatt commented 8 years ago

I can't believe this just dawned on me, but the equations of the for psi look something like this:

logit(psi[i]) = a0[i] + a1[i]*X[j] + a2[i]*X[j]^2

As a result, if the data set spans X enough to actually show "hotspots" of species occurrence (i.e., you can see the peak [or, if it's odd, the valley] of each parabola), this equation is going to suck at identifying species occupancy probabilities, since the model will force each species will reach its extremum (minimum for concave distributions, maximum for convex distributions) at the same value of X (X=0)!

For an example, see this code

t.func <- function(b0, b3, b4){x <- seq(-15,15, length.out=50); cbind(x=x, psi=plogis(b0+b3*x+b4*x^2))}

# number of species
ns <- 50

# parent means
mu.u.a0 <- logit(0.50)
mua3 <- 0
mua4 <- 0

# precisions (all species share a precision)
# I'm additionally assuming all of these parameters have the same
# precision, but that might not be true
# (this constraint does not exist in the msom)
tau <- (1/30) #0.01
tau.u.a0 <- 1/tau^2 #rgamma(1, 0.1, 0.1) #~ dgamma(0.1,0.1)
tau.a3 <- 1/tau^2 #rgamma(1, 0.1, 0.1) #~ dgamma(0.1,0.1)
tau.a4 <- 1/tau^2 #rgamma(1, 0.1, 0.1) #~ dgamma(0.1,0.1)

# species-sp
u.a0 <- rnorm(ns, mu.u.a0, sqrt(1/tau.u.a0))
a3 <- rnorm(ns, mua3, sqrt(1/tau.a3)) #~ dnorm(0, 0.001)
a4 <- rnorm(ns, mua4, sqrt(1/tau.a4)) #~ dnorm(0, 0.001)

plot(t.func(u.a0[1], a3[1], a4[1]), ylim=0:1, type="l")
for(i in 2:ns){
    lines(t.func(u.a0[i], a3[i], a4[i]), ylim=0:1, type="l")
}

Which produces a plot like this: dumbpsi

To me, that sucks. I think it would make much more sense to have an equation of the following form:

logit(psi[i]) = a0[i] + a1[i]*(X[j]-h[i) + a2[i]*(X[j]-h[i)^2

where h[i] is a species-specific parameter drawn from a parent distribution (like the other parameters).

@mpinsky @mtingley -- do you guys think the current parameterization of the MSOM (see here for the code) is weird? Should species be able to have "hotspots"/ peaks at certain values of an environmental variable?

Sal confirms my parameterization

rBatt commented 8 years ago

I might be a little wrong about the inflexibility of the above – here is a plot that is part of the summary/ status update I'm working on, but which shows the response curves estimated by the MSOM:

download

Clearly the species have different peaks (don't worry too much about the details of that figure right now).

I'm probably missing something here.

mtingley commented 8 years ago

Indeed, you're thinking about it from a polynomial perspective, not a logit perspective. A basic logit (e.g., logit(x) = a1 + a2*x) is not defining a line, it is defining a logit. The basic parameters of such a logit are defining: (1) where the transition point from 0 to 1 occurs, and (2) how fast that transition happens. Basically, where on x does it go from 0 -> 1, or 1 -> 0, and how fast does this occur.

When you have a logit-linear equation that includes a squared term, the squared term functions within this logit-transformation to create a "U"-shaped curve, but the parameters are still defining where on X this occurs.

Not quite sure what's going on in your simulation (first post) to result in such a limited range, but something is definitely funky, as your "basic" forms should be ranging from [0,1] not [0, 0.5] or [0.5, 1].

rBatt commented 8 years ago

Here are the MSOM estimates of the parameters pertaining to the above figure:

data <- structure(c(-19.2116519078113, -28.872892346392, -26.0159524899793, 
-19.3117838594931, -29.4912184283168, -16.7783052246822, -4.93871006159867, 
-19.8245845109941, -30.2731108646262, 3.3029800253413, -12.4854188240645, 
-30.1914514351973, -7.48522678448362, -4.85146615953838, -17.6825465394896, 
4.41894513931318, 1.47162979810875, -22.9049082521204, -31.953399782684, 
-6.45548352856289, -19.9042390486116, -6.96292914125016, -26.0488426318203, 
-6.86653475903166, -26.8474958652574, -7.25518132271425, -22.3528765264873, 
-6.08321427939244, -20.3065478511734, -18.3313399768043, -29.7400369689561, 
-18.6809201689981, -24.2510496014565, -24.0309082861132, -7.01600969837805, 
-26.1535903023711, -0.953887789010343, -17.8387513573409, -7.34578456244306, 
-23.5354726029655, -19.7220823534181, -22.1987334716902, -21.4722340554737, 
-19.7881368476475, -34.3341915618182, -8.30662601455929, -19.9938199792551, 
-9.4358890234851, -7.00499680495028, -18.9050969105359, 0.681870481559163, 
-2.44586894164051, 2.70870344782961, -4.66484954113801, 3.37561152132748, 
-3.65373162918475, -4.18410243139302, 0.624862397878237, 4.47425254237667, 
0.134909445347055, -3.40213691130827, 3.18176887731477, 3.79292625081778, 
-4.26496203983222, -3.25947075250068, -0.58851945738959, 0.0425978742605297, 
5.67120794888968, 5.09032304991323, -3.22926509792139, 0.849479906133968, 
4.0258757406155, 5.17369440114801, 3.77906920769196, 2.24205016034212, 
-3.36874241755876, 2.2999776283939, 4.62974017742124, -5.06837777029504, 
0.938610020511172, 4.65985424650344, 1.00553958055763, 6.24621231724498, 
3.45354298334696, 4.2085404209801, 3.97064279116027, -0.0203182635284996, 
3.44611304744004, 4.19966345180755, -6.10636254703139, 0.510365219988168, 
4.03450866407214, 5.25711926027517, 5.57863040222819, 5.534084696996, 
3.50698127728034, 4.95357227912073, 3.40745586987446, 4.07112265265578, 
0.743891173995625, -0.371807303646559, 0.00830380441832725, -0.0763680589784552, 
-0.278517017753682, -0.0961540650017136, -0.219672890533875, 
-0.48444092246383, -0.370032694934955, -0.156202901929227, -0.536133566786421, 
-0.231224056963212, -0.0812174207119295, -0.432010619719181, 
-0.489676933046021, -0.14244966047017, -0.475334353089282, -0.490779542799467, 
-0.332220953763913, -0.194002376111587, -0.384950337865744, -0.366758804388014, 
-0.458894381679883, -0.248600748811724, -0.428109987043335, -0.0486729591845517, 
-0.309837098039391, -0.0683044081769, -0.516066341225602, -0.290661334001578, 
-0.374424491607807, -0.176499903797647, -0.377234841662434, -0.357769592975686, 
-0.132644900718499, -0.476176997583575, -0.150624063632671, -0.43461325087277, 
-0.16029735878313, -0.461944674733988, -0.344337378034507, -0.382135723365485, 
-0.181445607611325, -0.305068836235906, -0.334778444310216, -0.209274137705028, 
-0.411079783831052, -0.298431883733099, -0.411571813559587, -0.462939310081013, 
-0.3527422644788), .Dim = c(50L, 3L))

And here's a histogram of each parameter:

png("~/Desktop/aHist.png");par(mfrow=c(2,2));apply(data, 2, hist); dev.off()

ahist

I can sweep out some summary statistics to figure out how to do those simulations:

apply(data, 2, mean)
> [1] -16.8240395   1.5649733  -0.3073367
apply(data, 2, sd)
> [1] 9.9286906 3.4481760 0.1430678

And after using similar parameters in the simulation that I coded in my first post, I get response curves like this:

dumbpsi

OK, so definitely possible to get peaks in different places. But because it seems like a weird way to do it, and because that's not the type of response curves shown in Zipkin et al 2010 (see Figure 3 on page 4), I'm thinking that if "hotspots" are an important/ common feature, that maybe this isn't the best parameterization (the plogis transform warped my intuition about what this model structure could achieve). Also, notice that those histograms aren't normally distributed --- not sure, but that might be related to the data augmentation.

rBatt commented 8 years ago

@mtingley Yeah – good point. I was writing up my other comments before I saw your reply. You're right that I forgot I'm on logit (:p); well, I forgot that I was on logit when building intuition.

As you can see, mimicking the summary statistics of those histograms while doing the simulation helps (those are the histograms of the top-right panel of my colorful figure in post 2).

I think the main difference is to have the intercept parameter be plogis(a0) ~ 0.

Although it still strikes me as odd, perhaps it really isn't.

mtingley commented 8 years ago

Well, the Zipkin paper figure is showing detectability on the y-axis, not occupancy. Although both modeled with logit-transforms, occupancy is generally more likely to go between [0,1] on a given range of x than p. Of course, this all depends on what your covariates are. All sorts of forms are possible. I could show you some other examples from 'real' MSOMs that show response curves like those in Zipkin, or curves like those in your above plot, depending on the covariate being tested and whether it's psi or p (if you'd like to see those, let me know). But it seems like the main issue is how you're perceiving this "hotspot" issue and what that means. I think I'm a bit behind on that, so you might have to explain it to me in more detail (e.g., skype).

rBatt commented 8 years ago

@mtingley Fair point about the realism of that example; I only pointed to that Zipkin figure b/c they had a figure with response curves for p (which was parameterized as a quadratic, as you point out), but not for psi.

It sounds like you're saying "this parameterization is pretty flexible, and seems good to me". And after playing and thinking, I think you're right.

I initially called this into question b/c I've been having some trouble getting the MSOM to recover the psi parameters. But there are several potential reasons for this, and I should do more work before I start getting worried.

Thanks for the quick response. Hopefully some more interesting/ enjoyable things will follow soon.

rBatt commented 8 years ago
rBatt commented 8 years ago

@mtingley This issue is taken care of, but I just realized I haven't shown you any of the empirical results yet.

I don't know if you remember me telling you this, but we've already analyzed a lot of empirical data. There are known imperfections, but it's a solid first whack. We started this simulation to get an idea of how much we should trust ours results, and how to improve them.

Anyway, HERE you can find response curves from those model runs, as well as some figures for observed and estimated richness over time. If you're interested in knowing more, we could Skype and I could provide a bit of a rundown.

mtingley commented 8 years ago

@rBatt Thanks for that. I don't necessarily what these things mean, but I can check them a bit for normalcy. Take for instance, this plot https://github.com/rBatt/trawl/blob/master/Figures/Diversity/msomCov/shelf_tempResponse.png. I'm assuming that each line is a species, and presumably the colored lines are either focal species or group means. What's interesting to me is that each year appears to have a very specific mean parameter, and then all species-specific parameters are year-specific, and are derived from the year-specific model.

I'm assuming that this was done with a single MSOM for each year, right? To me, this illustrates why it'll make more sense (hopefully!) when you convert this to a dynamic MSOM, because it makes more sense for each species' relationship to be more similar to its relationship in other years, then it is to its relationship to other species in a given year. In this current model, each species in 1 year has minimal species-specific information on the temperature relationship (particularly if the realized environmental space is shifting a lot from year to year) [and is the red bar, the realized environmental space for a year?]. By combining all years into a single model, you get a broader set of realized environments, you get a lot more data per species, and you can actually do things like fix temperature-psi relationships across years (i.e. assuming that thermal niches are fixed) or allow them to unidirectionally shift (i.e., thermal niches are adapting) or just allow them to be random per year, but based around a species-specific mean parameter (i.e., there's some year-to-year flexibility or wiggle). But regardless, once doing that, you'd want to see (and expect to see) species-specific relationships pop out. Currently, the heavy concentration of so many species around the mean annual trend indicates that there's not enough species-specific data to define the relationship so the majority of species are just being pulled into the hyper-parameters.

rBatt commented 8 years ago

Very good guesses! You could have written the caption for that figure.

The colored lines are quantiles. Black lines individual species. Red bar is the realized environmental space (temperature) in that particular year/ region. Each year was fit separately.

Your point about the hyperparameters dominating is interesting. On one hand, I definitely agree that what you're seeing is likely due to species being pulled to the parent mean, although I suppose it's possible that they're just similar. However, there are 2 reasons why some of these species being pulled in might make more sense: 1) I think (but am not sure) that these lines include the augmented species (the extra 0's); 2) Most of these taxa are quite rare, which makes means combining years would help, but it's hard to know how much.

On Thu, Aug 20, 2015 at 1:59 PM, Morgan Tingley notifications@github.com wrote:

@rBatt Thanks for that. I don't necessarily what these things mean, but I can check them a bit for normalcy. Take for instance, this plot < https://github.com/rBatt/trawl/blob/master/Figures/Diversity/msomCov/shelf_tempResponse.png

. I'm assuming that each line is a species, and presumably the colored lines are either focal species or group means. What's interesting to me is that each year appears to have a very specific mean parameter, and then all species-specific parameters are year-specific, and are derived from the year-specific model.

I'm assuming that this was done with a single MSOM for each year, right? To me, this illustrates why it'll make more sense (hopefully!) when you convert this to a dynamic MSOM, because it makes more sense for each species' relationship to be more similar to its relationship in other years, then it is to its relationship to other species in a given year. In this current model, each species in 1 year has minimal species-specific information on the temperature relationship (particularly if the realized environmental space is shifting a lot from year to year) [and is the red bar, the realized environmental space for a year?]. By combining all years into a single model, you get a broader set of realized environments, you get a lot more data per species, and you can actually do things like fix temperature-psi relationships across years (i.e. assuming that thermal niches are fixed) or allow them to unidirectionally shift (i.e., thermal niches are adapting) or just allow them to be random per year, but based around a species-specific mean parameter (i.e., there's some year-to-year flexibility or wiggle). But regardless, once doing that, you'd want to see (and expect to see) species-specific relationships pop out. Currently, the heavy concentration of so many species around the mean annual trend indicates that there's not enough species-specific data to define the relationship so the majority of species are just being pulled into the hyper-parameters.

— Reply to this email directly or view it on GitHub https://github.com/rBatt/trawl/issues/92#issuecomment-133098916.

rBatt commented 8 years ago

Eqns 3 in this Ecological Modelling paper Are pretty handy for gaining intuition about how these curves should go.

In R, here are some handy function to calculate the optimum environmental value (psi.opt), the width of the niche (psi.tol), and the maximum probability of occupancy (psi.max):

    psi.opt <- function(b1,b2){-b1/(2*b2)}
    psi.tol <- function(b2){1/sqrt(-2*b2)}
    psi.max <- function(b0,b1,b2){1/(1+exp((b1^2)/(4*b2)-b0))}

@mtingley @mpinsky @JWMorley Some of you might find that handy at some point.