rbchan / unmarked

R package for hierarchical models in ecological research
https://rbchan.github.io/unmarked/
36 stars 25 forks source link

C++ engine for gdistsamp #130

Closed kenkellner closed 5 years ago

kenkellner commented 5 years ago

Supports all combinations of key function, sampling type, and mixture. I added a bunch of tests to try to cover as many of the combinations as possible.

A possible future CRAN issue is that there are now some C++ compiler warnings apparently as a result of some namespace overlaps between RcppArmadillo and R_ext (used for integration). It does ultimately compile fine though, at least on my computer, and R CMD check is clean.

@rbchan I also found a possible bug in the R engine when there are missing y values. Can you check the commit comment in the patch to see if you agree with the change I made?

rbchan commented 5 years ago

@kenkellner This looks great. Regarding gdistsamp, it's been a long time, but isn't it the case that the vector y[i,,t] must be all NA or have no NAs? In other words, you can't have a count in the one distance interval without having counts in all the distance intervals (for a given site and a given primary period). Perhaps that is why I didn't try to handle NA's in that case. Or perhaps it was an oversight.

kenkellner commented 5 years ago

That makes sense to me. However, that's not how it seems to work currently - you can have individual NAs in a y vector and the algorithm drops just those values and not the whole primary period/site, based on my tests.

I commented the relevant section of code (before my edits) below to try to clarify the issue:

#naflag is a boolean array indicating when the corresponding y value is NA.

cp[J+1] <- 1 - sum(cp) # cp values for all Ys, even Ys that are NAs, included in sum

mn[, t] <- lfac.k - lfac.kmyt[i, t,] +
                    sum(y[i, t, !naflag[i,t,]] *  #NAs dropped here leaving a shorter vector
                    log(cp[which(!naflag[i,t,])])) +  #NAs dropped from cp here as well to match
                    kmyt[i, t,] * log(cp[J+1])  #but not accounted for here since cp[J+1] sums all cp

I think I agree with you though that if there are any missing values the whole site/primary period should be dropped from the likelihood. It's harder to imagine a common situation where you have missing distance interval counts vs. missing replicated counts (as with gmultmix). It would be pretty easy to change this in the code if you think that's how it should work.

rbchan commented 5 years ago

Yes, I think that is how it should work. And I thought that is how I had coded it, but it sounds like I was mistaken. If you're up for changing it, that would be great. Thanks again.