rbchan / unmarked

R package for hierarchical models in ecological research
https://rbchan.github.io/unmarked/
36 stars 25 forks source link

Transition to `reshape2` and allow pass-through arguments to `unmarkedFrame*` constructors if `formatLong`; add addn'l unit tests #114

Closed adamdsmith closed 5 years ago

adamdsmith commented 5 years ago

This PR started simply to allow additional arguments to pass through the formatLong function to the specified unmarkedFrame* constructor functions. It soon became clear, however, that it would be easier to accomplish this by transitioning from older functions in the reshape package (e.g., melt, cast, and recast) to their updated versions in the reshape2 package (e.g., melt, acast, dcast). The two commits in this PR separate those tasks (edit: four commits; I caught a couple of problems in the DESCRIPTION file).

All relevant unit tests have been updated and pass (excepting an unrelated parboot test to be fixed in a follow-up PR), and I've added a test comparing a manually created unmarkedFramePCO (from the example in the help file) with the corresponding object created by formatLong.

Nonetheless, it's probably worth waiting to confirm the initial prompt for this PR passes, and throwing some additional tests for other unmarkedFrame* types at the formatLong, formatMult and csvToUMF functions.

rbchan commented 5 years ago

@adamdsmith are you planning on adding additional tests, or should I go ahead and merge this pull request?

Also, I'd like to add you to the list of package authors if that's okay by you.

adamdsmith commented 5 years ago

@rbchan Let's wait at least until Geraldine confirms that it works for her problem. I can also try to add a few more tests to this PR. In fact, towards that end I've discovered that formatMult is pretty buggy for automated creation of unmarkedMultFrames... Fixing that is likely a separate PR that I don't have time for right now. But, if you think it's helpful towards this PR, I can try to put together some basic tests for some of the other unmarkedFrame* objects handled by formatLong...

And if you think adding me to the list of package authors is warranted, I won't argue too much. :)

adamdsmith commented 5 years ago

@rbchan I added a unit test for a fairly basic unmarkedFramePCount --- checks out fine. I modified formatLong to handle the no obsCovs restriction on data intended for distsamp --- a basic unmarkedFrameDS checks out fine. And I've precluded the use of formatLong and formatWide for multinomial data. If the former case, I can't think of a way any reasonably person would enter multinomial data in a long format. And in the latter, there are competing type arguments that are not worth dealing with. We now advise to use the appropriate constructor function directly.

rbchan commented 5 years ago

@adamdsmith I went ahead and merged because it looks like you've done plenty of testing. We can modify things later if Geraldine finds problems. Thanks.

gklarenberg commented 5 years ago

Hi @adamdsmith and @rbchan, thanks for working on this! I'm still struggling with something. I am trying to use formatLong to make an unmarkedFramePCO. Does formatLong work for cases where you do not have the same number of observations for each site within a year - and also different per year? E.g. in my case I have this uneven sampling, where site A might have 3 observations, site B 6 observations and site C 4 observations for year 1, but then for year 2 it's e.g. 5, 4 and 1. From the way I figure the function works, you'd need to take the max number of observations for a year (any year), and then populate it and leave NAs if there are less observations than the max. So it'd look like this, with y.1 to y.6 being the observations from year 1 and y.7 to y.12 the observations from year 2:

    y.1    y.2    y.3    y.4    y.5    y.6    y.7    y.8    y.9    y.10    y.11    y.12
A   1      0       1      NA    NA    NA     1       0       0      0        1      NA
B   0      0       0      1     1     0      0       0       1      1        NA     NA
C   1      1       1      0     NA    NA     1       NA      NA     NA       NA     NA

This way, when you tell the function how many years there are, it can count the columns and divide by the number of years (in this case 12/2) and you get all your observations in the correct year (or season). Am I correct in this understanding? I've tried going through the package code but wasn't quite able to figure it out. What happened with my own data, is that I have 18 years, and when I made the unmarkedFramePCO manually, I created a matrix of nsites x years x max_count with NAs, as above. In my case 53 x 18*19 (there is one site that in a particular year has 19 observations). I populate this with the data. So then I get an unmarkedFrame object occu_file with

53 sites
Maximum number of observations per site: 342 
Mean number of observations per site: 115.66 
Number of primary survey periods: 18 
Number of secondary survey periods: 19 
Sites with at least one detection: 20

However when I use formatLong, I get

53 sites
Maximum number of observations per site: 243 
Mean number of observations per site: 115.66 
Number of primary survey periods: 18 
Number of secondary survey periods: 13.5 
Sites with at least one detection: 20 

Now, it is correct that the site with the most observations has 243 observations. But my sense is that when I provide the number of years, the function did 243/18 to get 13.5 secondary survey periods. Looking at my 'occu_file@y' it also looks like all the observations are at the beginning and NAs at the end. So I guess in short, my question is; how does the function "know" which observations belong to which year (when sampling is uneven)?

adamdsmith commented 5 years ago

how does the function "know" which observations belong to which year (when sampling is uneven)?

I expect it can only "know" based on the information you give it in the primaryPeriod matrix. See ?unmarkedFramePCO. If the primaryPeriod matrix isn't specified, it simply assumes balanced sampling, hence the 243/18 = 13.5 secondary survey periods.

Unfortunately, I don't have much experience constructing primaryPeriod matrices. In particular, I don't have any advice on constructing the matrix for unbalanced sampling and I find the examples in ?unmarkedFramePCO confusing...

gklarenberg commented 5 years ago

I added yearlySiteCovs as yearlySiteCovs=list(year=year) with year <- matrix(as.character(1:year_count), nrow(y_all), year_count, byrow=TRUE) (with year_count being 18 in my case) as per the examples - but yes, I have also been confused on whether that's sufficient. I also set numPrimary=year_count

gklarenberg commented 5 years ago

And I apologize, maybe I didn't make my question clear. Obviously the algorithm does some sort of counting of columns and then takes the number of primary periods to figure out which columns (observations) belong to which year. So I am trying to understand if it takes observations sequentially (like my example), or whether it takes yr1_obs1, yr2_obs1, yr1_obs2, yr2_obs2, yr1_obs3, yr_obs3 (because that's what the output from formatLong() looks like in my case). Essentially I am trying to figure out if my self-contructed example of y is correct, or the one I get from using formatLong(). From looking at the y component when using formatMult() - assuming that pcountOpen() needs a similar matrix - my self-constructed example is correct, i.e. y.1 through y.6 are the observations for the first year, y.7 through y.12 are for the second year etc. So I deduct that when giving the number of primary periods when using unmarkedFramePCO(), the algorithm takes the number of columns, divides it by the number of primary periods, and then it "knows" which observations belong to which year. So for now I am going ahead with the manual construction, but if anyone knows whether I'm completely off-base here, please let me know!