Covariate selection - Githubissues

kaitlyngaynor commented 1 year ago

Update on single species covariate exploration:

genet: we said water, termite, tree cover. none of them individually or in pairs improve upon the null model, but termite mound has the lowest AIC
civet: we said water, termite, tree cover. all three improve upon null model (but barely for termite). water does the best among all three individually or in pairs (distance to lake does better than distance to water)
honey badger: we said termite mound density. that model yields a singular hessian and does not improve upon the null. distance to water is by far the best fit and doesn't throw any warnings. I think it's reasonable to think that water matters for all four of these species, even if we didn't expect it especially for the honey badgers.
marsh mongoose: we said distance to water, which does not improve upon the null (only termite mound does)

to try:

genet: termite
civet: water
honey badger: water
marsh mongoose: water

When I run this as a Type 1 (no interactions model), it beats null by 1.7 AIC points. I feel like I've read (or you've told me) that delta 2 is enough to consider models sufficiently different. What do you think?

Originally posted by @klg-2016 in https://github.com/kaitlyngaynor/gorongosa-mesocarnivores/issues/136#issuecomment-1666065710

kaitlyngaynor commented 1 year ago

Interesting!

If none of the hypothesized covariates influence genet or marsh mongoose occupancy in the single species models, I wouldn't include them in the multispecies model? No need to overparameterize.

There isn't a hard-and-fast rule, but yes, generally we say that any models within 2 AIC of the top model have strong support.

klg-2016 commented 1 year ago

Thank you for keeping this more organized!

Another update is that running the above covariates for each spp as a Type 2 model (constant for all pairwise interactions) throws a warning and produces NaNs for marsh mongoose and marsh mongoose: civet intercepts.

klg-2016 commented 1 year ago

I can't articulate why I think it wouldn't be okay, but is it okay not to include covariates for some of the species?

klg-2016 commented 1 year ago

you're proposing

genet: ~1
civet: water
honey badger: water
marsh mongoose: ~1 ?

I think I'm getting stuck on what you would talk about in the discussion if this ends up as your top model.

kaitlyngaynor commented 1 year ago

Hmmm. Maybe our best approach is to just remove all occupancy covariates entirely. (It would be okay to only include covariates for some, though, in response to your question.) We can say in the methods that the inclusion of occupancy predictors did not greatly improve model fit, either in the single- or multi-species models, and resulted in an overparameterized model—so the focus of our analysis was instead on examining whether occupancy covaried among species (rather than as a function of environmental features)?

klg-2016 commented 1 year ago

Okay so this would be comparing null type 1 (no interactions), null type 2 (constant interactions, constant occ) and then the type 3 models we ran but without the occupancy covariates? (so ~1 for all single species and all species pairs except one per model). Is that what you're thinking?

I'm not saying good or bad, but that would lead to a fair amount of change in the paper as it stands right now because there's a lot of discussion of how different environmental covariates affect occupancy for these species (even though it's not a key point in the results, it's still there) that would no longer be relevant.

klg-2016 commented 1 year ago

If you're around to chat early next week, I'm wondering whether it makes sense to continue this conversation when we can talk about it live?

kaitlyngaynor commented 1 year ago

yep that was the thought! But you've spent more time with this than I have, and it's just an idea.

I think the only big change to the paper would be to remove a couple of paragraphs, and basically everything else could stay the same... I am about to actually dive into the text (haven't spent much time with it yet) and that will give me some more perspective. I agree that chatting would make the most sense.

klg-2016 commented 1 year ago

Okay, sounds good! Charlie (puppy) is telling me it's time to go out, so I'm going to take him for a walk and step away from my laptop for a bit. I'll be interested to hear your thoughts after you've read this draft and look forward to chatting on Tues

klg-2016 commented 1 year ago

how did you imagine lion presence fitting in to this new set of models? still as an occupancy covariate or also moving to detection? right now it's only included as a covariate affecting the interaction between species pairs, which, as we determined earlier, is only an occupancy thing. I think for the question we've presented (whether apex predator presence affects meso species interactions), it stays where it is, what do you think?

also initial checking for genets shows that including the covariates in the detection part of the model has an effect, I'll update on here once I've run for all 4 species

kaitlyngaynor commented 1 year ago

I think it stays where it is, yes!

klg-2016 commented 1 year ago

this is part of what we'll talk about tomorrow, but I want it spelled out to refer to:

genet: tree + termite
civet: water + termite
honey badger: termite
marsh mongoose: water

I also kept in detect.obscured and cover.ground.

kaitlyngaynor commented 1 year ago

is this what emerged as the top detection probability covariates? if so, makes a ton of sense...

klg-2016 commented 1 year ago

yes! exactly. for both civet and genet, using all three of the covariates we thought of a priori was close to the top, but this is the actual best AIC for all 4 spp

kaitlyngaynor commented 1 year ago

hooray! this makes me very happy, hopefully you as well, hah. feels like the light at the end of the tunnel

klg-2016 commented 1 year ago

it absolutely does! it's incredibly satisfying when relationships that we expect to be there actually are. I'm working on running the full type 1, 2, 3 models now

kaitlyngaynor commented 1 year ago

and does the direction of the relationship make sense? (positive association of genet and tree, negative association of marsh mongoose and distance-to-water, etc)

klg-2016 commented 1 year ago

These are the results from the top model (type 1, null occupancy, best covariates for detection + detect.obscured + cover.ground)

genet: positive with tree and termite, as expected
civet: negative with distance to water, as expected, negative with termite (surprising)
honey badger: negative with termite (surprising)
marsh mongoose: negative with distance to water, as expected

Am I reading that correctly? That's from the estimate and SE columns. What are z and P(>|z|), do you know? If termite is the only one yielding "weird" responses, am I interpreting that correctly?

kaitlyngaynor commented 1 year ago

Am I reading that correctly? That's from the estimate and SE columns.

Yes, you're reading that correctly.

What are z and P(>|z|), do you know?

The Z converts your estimate to a Z-score (standard normal distribution) so that you can calculate an associated p-value. So then P(>|z|) is the p-value, or the probability of getting a value as or more extreme if there were truly no effect.

If termite is the only one yielding "weird" responses, am I interpreting that correctly?

I'm pretty sure you are—it's the number of termite mounds in the vicinity of the cameras... not totally sure what to make of that. I just checked this handy file I found randomly on my computer and it doesn't seem to be highly correlated with other things that could be confounding it: metadata_correlations.csv

klg-2016 commented 1 year ago

Is the p like a normal stats p? So we want it <0.05 for significance? If so, then genet: termite is not significant

kaitlyngaynor commented 1 year ago

Normal stats p, but when using AIC you usually don't put too much stock in p-values. They're kind of different approaches. There's lots to be read on the subject, but section 5 of this recent paper is short and sweet: https://royalsocietypublishing.org/doi/10.1098/rsbl.2019.0174#d1e986

klg-2016 commented 1 year ago

Thanks for that link!

We chose the termite variable intentionally, so I think it's probably the one we want to stick with. I just looked through the correlation sheet you shared and the termite count variables are correlated with each other, but not necessarily with the termites.1km (percent of area in a given radius covered by termite mounds, created by Ryan Long and co), for example. do you think I should consider trying one of the percent termite variables?

kaitlyngaynor commented 1 year ago

I was also thinking about using Ryan's—but I actually trust my measurements on the ground more, at least at the 100m scale. But maybe a larger radius is more appropriate in terms of what influences their activity/density. Curious—what happens if you try the larger radii?

FYI here's another good paper on model selection in ecology: https://esajournals.onlinelibrary.wiley.com/doi/10.1002/ecy.3336 I think our goal is "inference."

More broadly, I've been scratching my head about hypothesis-driven model selection when the models actually indicate support in the opposite direction. Do we just drop the variable and treat it like a one-tailed test, or do we accept that there's some unknown explanation for the phenomenon (collinear variable, statistical oddities, etc)? This is a bigger question than we probably have room to explore here!

klg-2016 commented 1 year ago

I will test the larger radii before we meet tomorrow, and would appreciate talking about this more when we chat! I am also unsure what to do with a result in the opposite direction to what was expected.

I'll check out that paper - I'm about to head out for a run now.

sorry, clearly never pressed send, so everything but the bit about the run is still true

klg-2016 commented 1 year ago

putting this here for easier reference later:

I checked single species for termites_250m and termites_500m, the larger radius had the expected relationship for each species we landed on it for. so I ran it in the full type 1 model, and it "fixes" the termite relationship but changes the civet relationship with water.

klg-2016 commented 1 year ago

playing with the new termite layer:

civets: they stay negative with termite at 500m_count, 500m, 250m, and 1km_count (all similar AICs, though the counts do a bit better)
genet: tested termite at 500m_count, 500m and 250m, 1km_count, it no longer improves upon the null in the single species model
honey badger: also stays negative across the board (250m, 500m, 500m_count, 1km, 1km_count) (again, similar AICs, except 500m for whatever reason, but the counts do a bit better)

my thoughts:

we should use this new layer
termite no longer gets included in the genet part of the model
it stays in for civet and honey badger, and we discuss potential explanations for the result differing from expected
absolute best AIC for single species/single covariate (among those I checked) is the 1km_count for both honey badger and civet, so I'm leaning toward changing to that variable

What do you think?

klg-2016 commented 1 year ago

Just checked and the larger radii count variables improve upon the 100m count variable we were using, AIC-wise, though I did not check the 100m level with the new layer

kaitlyngaynor commented 1 year ago

Interesting! And a bit surprising, but I feel good about the quality of this layer, so we should go with this over the older one.

I think that is sound logic—though I'm inclined to choose the radius based on home range size rather than AIC. Maybe ping Erin and see if she can quickly tell you whether (ballpark) civet home range is closer to 0.78 km2 (500m radius), or 3.14 km2 (1km radius), or something else? If she can't, or if she doesn't get back to you soon, then proceeding with 1km seems fine.

kaitlyngaynor commented 1 year ago

And I like the idea of using count over density. Hallie somewhat roughly drew polygons around the mounds so I wouldn't put a ton of stock in their size.

klg-2016 commented 1 year ago

All sounds good! I just emailed her, I'll see what she says.

klg-2016 commented 1 year ago

Erin is still at GNP and has poor service tonight, but she thinks that the 1km radius is closer to civet home range. With that, I'm working toward what I think may be our final set of detection covariates.

genet: tree_hansen (termite no longer improves upon the null, the model with only tree just beats tree + termite)
civet: tree_hansen + water_dist + termites_1km_count (this model just beats termite+water)
honey badger: termites_1km_count
marsh mongoose: water_dist

I'm going to start working with the full models now

kaitlyngaynor commented 1 year ago

excellent

klg-2016 commented 1 year ago

I ran all the full models with these covariates, "null_occ_table_draft" from this spreadsheet is a summary of them and their AIC values.

The only one that doesn't run nicely is when I have lion as a covariate affecting the civet:honey badger interaction for occupancy. it throws NaNs for civet, genet, and civet: genet occupancy SE, z, and p(z) values (it does provide an estimate though). For my thesis we didn't include type 3 models that didn't run, but we didn't have any in the final set that threw NaNs.

Next step with model stuff will be editing the text to reflect this final set and making the table prettier. What do you think about the cranky model?

kaitlyngaynor commented 1 year ago

Did you do any googling about why these NaNs may crop up? That would be my next step. You don't get any warning messages?

klg-2016 commented 1 year ago

This is the warning that comes with the NaNs: Warning message: In sqrt(diag(vcov(obj, fixedOnly = fixedOnly))) : NaNs produced

The first suggestion online is to standardize the covariates, which I've done except for detect.obscured because that's binary. we discussed that covariate a bit in issue #127 .

Notes from responses about this issue to others online:

The bigger issue is that you have very little variation in occupancy to model.
What that might mean is that you don't have enough data, or that all your site counts may be too similar for the model to get a stable estimate of the effect of vegetation type.
from Ken Kellner to a particular query: In all cases the estimated occupancy intercept is very large (theirs were ~13-14, the ones from this model for civet and genet are large enough that the occupancy probability is basically 1, as we knew), which means the model is estimating occupancy probability to be 1 (plogis(some large number) = 1). Given the information about your dataset that I have, I am not surprised. You have a very small number of detections relative to the number of observations (~5%) (our detections/(detections+non-detections) is about 13%) which means detection probability will likely be very small (which the model results do show) and thus occupancy will be very high. This is especially true because you detected the species at least once at most of the sites (~60%).

My conclusion is that you are unlikely to get useful occupancy results from this dataset, at least using unmarked. The data are unfortunately just too sparse. It is not a good idea to use parameter estimates from unmarked when occupancy is estimated at 1 and/or there are issues getting the standard errors. I am guessing the occupancy intercepts are very large even for the other models where the SEs are not NaN.

I'm not sure what your observations represent (individual days?), but you could try collapsing your y matrix into a smaller number of longer observation periods e.g. weeks instead of days taking care that you still meet the assumptions of occupancy modeling in your study system such as population closure.

You may have more success fitting the model in a Bayesian framework, although I would expect the results in this case to be strongly influenced by the prior distributions for better or worse.

from Jeffrey Royle: Sparse data , small data sets, could cause it (e.g., either all sites occupied or not occupied or the right mix with respect to the covariates).
from Marc Kery: If ALL of your sites in your sample have at least one detection, then the occupancy estimate is bound to be 1 and then with the usual methods one cannot obtain an SE.

I think it comes down to civets and genets having basically 100% occupancy, but I'm not sure 1) why it only produces an issue with the one model or 2) what the best response is. Our detection estimates are also pretty low (negative values, like in the issue Ken Kellner responded to). I wonder if there is too little civet:honey badger overlap, so adding the lion covariate is making the model too complex? I'm also not sure how that connectss to the civet/genet 100% occupancy, though it relates to Royle's response. One option I see is just not including that model.

Keeping this link for reference: https://groups.google.com/g/unmarked/c/uWTMkNNDxRw/m/D6bKAEk_FQAJ

kaitlyngaynor commented 1 year ago

Good digging! Hmm. I do think we are asking these models to do quite a lot with a little data. It's probably overfit, once you put in that particular combination of covariates (which is why you aren't always seeing this problem).

I think that your "just not including that model" option is probably the most straightforward/appropriate—what are you thinking?

klg-2016 commented 1 year ago

I think I'm happy with that. From what I found/shared, it sounds like we shouldn't trust a model that doesn't produce all values, so that particular model should not be included for sure. Out of curiosity, I just ran that model (type 3, lion affecting civet:hb interaction) without detect.obscured, and it runs happily. I don't see my notes on it, but I do remember one of the first model building steps for my thesis was testing which detection covariates to include, and including both detect.obscured and cover.ground was the best option (so it's there for an AIC-based reason). At this point, the only other option I see would be excluding detect.obscured across the board. What do you think? Pros: probably lets us report all potential models we're interested in, excludes a variable you've said you don't love, lets us avoid having to explain why we're excluding a single species pair (which is not quite as straightforward as it was for my thesis, when the models simply didn't run). Cons: it takes time to rerun the models without detect.obscured (but doable if we decide today), detect.obscured does improve the model fit

kaitlyngaynor commented 1 year ago

Question about detect.obscured: does it do what we think it does? (In other words, is there consistently a negative effect of detect.obscured on detection probability?)

If so, then I think we keep it and justify the exclusion of the model as we did in your thesis (I think it's fine to just replace "model didn't run" with "model didn't produce Standard Errors so we excluded it from consideration")

If not, maybe we drop detect.obscured consistently. Could open new cans of worms by changing other parameter values, though...

klg-2016 commented 1 year ago

Yes, there is a consistent negative effect of detect.obscured on detection probability (for all models we're considering, the estimate is negative for detect.obscured for every species)

Sounds good, I'll keep going with it in.

kaitlyngaynor commented 1 year ago

Ok. Phew. Keep going with it in!

klg-2016 commented 1 year ago

Thank you!

klg-2016 commented 1 year ago

adding lion to the mix, picking the best model within the top 2 AIC with the fewest variables:

genet: tree
civet: termite, water, lion
honey badger: termite
marsh mongoose: water

klg-2016 commented 1 year ago

plot twist: only 2 of 6 type 3 models run well with the final list of covariates. the other 4 all produce NaNs for marsh mongoose and marsh mongoose:civet occupancy SE, z and p values. from what I read the other day, I don't think we should trust a model that produces NaNs like that

kaitlyngaynor commented 1 year ago

Aha. Probably not the most surprising plot twist as we are now asking lion to explain two different things in the model.

I might actually just remove all of the type 3 models from the results. Say you tried them, keep the research question in, but that most of them failed to produce reliable models likely because they were over parameterized so you can't draw any conclusions about lions. And keep the discussion about the general lack of lion effect on mesos but add something about civet effect. And mention that future work is needed to actually test your hypotheses about the effects of apex predators on interactions. Does that make sense?

klg-2016 commented 1 year ago

yep, that makes sense and sounds good! will work on adding/editing today and tomorrow

klg-2016 commented 1 year ago

I'm not sure how this happened, but I'm running the models again in the code for publication (to make sure everything runs with stuff cut out/renamed for clarity), and two of the models that had NaN warnings no longer have them tonight. the only thing I can think is that I'm pretty sure I had double scaled the lion covariate (it was already scaled in the spreadsheet I had, and I scaled it again last time), which I caught and fixed. it does not have a significant impact on either the type 1 or type 2 models, but it eliminates two NaN warnings like I said and slightly changes the AIC value of a third type 3 model. does that change your thoughts on whether to report them? now 4 of 6 seem to be running just fine. also does that make sense as an explanation for the difference? otherwise the code and inputs are exactly the same, I'm just renaming things/deleting unused columns from spreadsheets/etc.

kaitlyngaynor commented 1 year ago

Scaling an already-scaled variable should have no effect on their values. It's like multiplying by one.

BUT it is possible that the variable in the spreadsheet was scaled based on the mean and SD calculated from the values in the entire study area (however that was defined). Then you scaled it AGAIN just for the 60 values at the camera sites. In that case, you'd end up with slightly different values, and you probably SHOULD scale it again based on the mean & SD for just the 60 values, since that's what you did for all of the other variables.

Can you try plotting the values from the spreadsheet against the values after you scale them again, and see what it looks like? Also consider plotting the histograms of each of those values to see their distribution?

klg-2016 commented 1 year ago

In my head scaling would reduce the spread of the variable, but it sounds like I'm wrong there.

this is single and double scaled:

single scaled hist:

double scaled hist:

kaitlyngaynor commented 1 year ago

ok so they are indeed different! I'd re-scale using the mean and SD from the 60 cameras, then.

Scaling using the scale() function essentially just converts values of a variable to a Z-score, in which a value of 1 represents 1 SD above the mean, a value of 0 represents the mean, etc. So your values change slightly depending on what sample you are using (since they will have a different mean and SD). It's not really "double scaled." In this case you are using the sample of 60 cameras (rather than every single pixel in the study area). Does that make sense?

If you scale lion_latedry_scaled, it should produce the exact same thing that you input. Double-check to convince yourself.

klg-2016 commented 1 year ago

yep, that makes a lot more sense now - thank you! so I will scale the lion_latedry covariate, which should revert to 4/6 type 3 models not running, and brings us back to including the research question but not the models

kaitlyngaynor / gorongosa-mesocarnivores

Covariate selection #137