Additional code for summarizing model output

mdscheuerell commented 5 years ago

As @ebuhle demonstrated on 10/26, shinystan() is a convenient way to examine some of the diagnostics and summaries of model fits. However, we will like want some additional R scripts for customizing our output. Eric and I have several options from other projects that we can draw upon.

DavidTallmon commented 4 years ago

If you guys have any scripts for model fits and selection with covariates from other projects can you send those along? I'd like to start playing with those.

ebuhle commented 4 years ago

Hi @DavidTallmon, I'd be happy to do that, but I was planning on adding some plots etc. to the script / RMarkdown in this repo. I imagine that would be simpler than asking you to modify existing code from other projects. If you can give me a day or two I'll try to carve out some time to work on this.

DavidTallmon commented 4 years ago

Sounds great. Thanks.

mdscheuerell commented 4 years ago

If you can give me a day or two I'll try to carve out some time to work on this.

I'm trying to wrap up some other things so I can focus some time on this as well.

DavidTallmon commented 4 years ago

I'm also happy to modify any existing code if you guys can't find time and want to send some along.

ebuhle commented 4 years ago

I'm running the Auke coho models today...standby...

ebuhle commented 4 years ago

FYI

Things seem to be working fairly well now. The Beverton-Holt model still has some unpleasantness when it comes to tau_M and tau_S:

and the energy diagnostics for all three models suggest some thick tails that are difficult to sample from:

All in all, though, regularizing away the left tail of the obs error SDs makes these models much better-behaved, and faster: at least an order of magnitude speedup in wall time per iteration, and even more than that per effective sample.

Here's what the regularization looks like for the Ricker model:

The LOO comparison is still hopeless; 100% of the pointwise likelihood posteriors are in the "bad" or "very bad" categories. For the time being I'm proceeding with the Ricker, since the productivity estimate is more reasonable than the B-H. As you can see, there's still basically no information about the slope at the origin:

That said, the fits to the abundance data don't look bad at all:

I have some more ideas for plots to show some of the interesting life-history patterns we can infer from the parameter estimates -- for example, annual SAR and ocean-age proportions for 2- and 3-year-old smolts -- but for now the existing plots (and other code chunks in the script) are incorporated into the RMarkdown via knitr::read_chunk().

ebuhle commented 4 years ago

Here are some (arguably) nifty plots showing life-history variation through time.

For the freshwater stage, we have the smolt recruitment process error residuals (i.e., cohort strength after adjusting for the expected density dependence) and the proportion of each cohort that outmigrated as 2- vs. 3-year-old smolts. Age-2 is usually dominant, but not always.

And for the marine phase, SAR and the propensity to return as a jack given smolt age. SAR for 2- and 3-year-old smolts is remarkably uncorrelated, which is also reflected in the estimates of R_MS. Older smolts usually survive better, but not always (what was going on in the early 2000s?) and are more likely than age-2 smolts to return as jacks.

ebuhle commented 4 years ago

I added some more plots comparing model fit (states) to observed data, in this case smolt and spawner age composition. Together with the smolt and spawner abundance time series shown above, these constitute all the likelihood components for this model.

It's a little surprising to see such lack of fit to the adult ocean age and Gilbert-Rich age data, considering the (perhaps dubious) precision implied by a multinomial likelihood with such large sample sizes (indicated by the sample binomial CIs). I suspect this is a, if not the, major contributor to the too-skewed-to-use pointwise likelihood posteriors and thus bad / very bad Pareto ks. (I can see @mdscheuerell nodding.)

On that note, @DavidTallmon, can you remind me of the provenance of these age-composition samples? It looks like every spawner over the weir is classified by ocean age (jack vs. 1-salt), a subset of them are sampled for full fresh-salt age, and a subset of outmigrating smolts are aged. Do I have that right?

I also considered whether we ought to exclude jacks from the "spawners" in the spawner-recruit function as we've done in the OR coho analysis, on the grounds that it gets closer to a female-only model and removes noise associated with variable jack proportions. However, it turns out there's no discernible relationship between the proportion of spawners that are jacks (either observed or estimated) and the recruitment process errors, so...meh. On the other hand, that doesn't mean variable sex ratio might not explain some of the process error. I can't remember, do we have sex-composition data?

Different but related: the model currently assumes jacks are not subject to harvest. Is that a good assumption?

DavidTallmon commented 4 years ago

These are definitely nifty figures to study. Scott is looking into the early 00s marine survival/age structure anomaly to see if there's a reasonable biological or sampling explanation. Maybe the covars will help with that, though I wish we had some early marine predator info to include. You're correct on age comp. All spawners are ocean aged and sexed at the weir. Approx 5-10% of smolts and 20% of spawners are aged from scales each year for at least many years and probably many, many years back. The zero harvest of jacks seems a reasonable assumption. I can't think of any fishery in SEAK that would harvest jacks in any appreciable numbers.

ebuhle commented 4 years ago

Thanks @DavidTallmon. I guess it might be worth looking at the sex ratios to see if they're associated with the smolt recruitment anomalies.

Re spawner age structure, the plot (LOL) thickens. It dawned on me that the weir data and the scale samples might imply different ocean-age proportions, which would explain why the model struggles to fit either one. Sure enough:

I'm on the fence about whether it's even statistically valid to include both of these as likelihood components, since the scale samples are a subset of the same individuals aged at the weir, hence obviously nonindependent. Technically the answer is clearly no, although there is some research suggesting nonindependent data in an IPM can sometimes be innocuous. @mdscheuerell, any thoughts on this? We could try leaving out the weir data on ocean age and just using the scale samples (the reverse won't work, as the full Gilbert-Rich age is needed to identify the smolt age-specific SAR and ocean age probabilities). Or I suppose we could remove the individuals with scale samples from the weir samples so they're independent, which wouldn't require any additional info. That would only increase the conflict b/w the two data sets, though.

DavidTallmon commented 4 years ago

I think we could really benefit from bringing in Scott's insights on daily operations at the weir. Perhaps there is/was historically some scale sampling weirdness occurring -- it seems like it would be really difficult to misassign ocean ages at the weir given the limited classes and obvious differences b/w classes. (I've emailed Scott w/ the fig above w/ a request for any insights.)

ebuhle commented 4 years ago

Yeah, @svulstek's input would be helpful. He's a collaborator on this repo, so would be best if he weighed in here to keep the discussion in one place.

mdscheuerell commented 4 years ago

Re: the use of both visual age data and scale samples, am I correct in the that the scale samples were taken from fish that were also aged visually, or are they a separate group?

This also strikes me as an opportunity to show the comparison between the information you could expect from the two different sample types by trying @ebuhle's suggestion to only use one or the other data type.

ebuhle commented 4 years ago

Yes, I think all spawners over the weir are aged visually. n_MSage0_obs + n_MSage1_obs == S_obs.

(Edited for insufficient caffeine.)

DavidTallmon commented 4 years ago

Yes, to my limited knowledge, there are only jacks and 1 SW coho in all of SEAK. Ages and sexes are assigned (if I can say that) at the weir.

DavidTallmon commented 4 years ago

The decline in age-3 smolt survival is really interesting. Does the smolt to adult survival include jacks as adults (all spawners) or are spawners separated into adults and jacks? I think its the former, but want to be sure. Also, is there a cheat sheet for the model parameter abbreviations for IPM_SMaS_np? Some parameters are easy to infer from the abbreviations (a, Rmax, etc), but others take bit of time to suss out. I couldn't find this in the few places I looked.

ebuhle commented 4 years ago

The decline in age-3 smolt survival is really interesting. Does the smolt to adult survival include jacks as adults (all spawners) or are spawners separated into adults and jacks? I think its the former, but want to be sure.

That caught my eye too. And yes, you are correct. s_MS[t,j] (SAR, or survival from smolts M to spawners S, for smolts of age class j in ocean entry year t) applies to returning spawners of both ages, and p_MS[t,j,k] is the conditional probability of returning at ocean age class k given that you survive. The notation p_x refers to conditional age probabilities for life stage x indexed by cohort, and q_x is the age distribution indexed by calendar year. So, for example, the number of 2-year-old smolts (smolt age class 1) that return as jacks (ocean age class 1, not subject to harvest) in the same calendar year they outmigrated is M[t]*q_M[t,1]*s_MS[t,1]*p_MS[t,1,1].

When I have some time I'll write out the math in the RMarkdown, which will hopefully clarify things. The motivation for using this SAR / conditional-age-at-return parameterization is that it rearranges the otherwise nonidentifiable annual marine survival and maturation probabilities into identifiable functionals.

Also, is there a cheat sheet for the model parameter abbreviations for IPM_SMaS_np? Some parameters are easy to infer from the abbreviations (a, Rmax, etc), but others take bit of time to suss out. I couldn't find this in the few places I looked.

par_defs("IPM_SMaS_np") will give you a cheat sheet. The Dimensions column is a little off in some places due to a combination of transcription errors from the Stan code and the fact that multiple models may contain parameters with the same name but different dimensions (we have an open issue in salmonIPM to address the latter), and the order of appearance is a bit scattershot, but the definitions are right.

> par_defs("IPM_SMaS_np")[,-2]

      Parameter/state Definition                                                                
 [1,] "alpha"         "Intrinsic productivity"                                                  
 [2,] "Rmax"          "Asymptotic recruitment"                                                  
 [3,] "p_HOS"         "True prop of hatchery-origin spawners (HOS) in years with H fish present"
 [4,] "B_rate_all"    "True broodstock take rate in all years"                                  
 [5,] "S"             "True total spawner abundance"                                            
 [6,] "beta_M"        "Regression coefs for spawner-smolt productivity"                         
 [7,] "rho_M"         "AR(1) coefs for spawner-smolt productivity"                              
 [8,] "sigma_M"       "SD of spawner-smolt process error"                                       
 [9,] "mu_MS"         "Mean smolt-to-adult return (SAR)"                                        
[10,] "beta_MS"       "Regression coefs for SAR"                                                
[11,] "rho_MS"        "AR(1) coefs for SAR"                                                     
[12,] "sigma_MS"      "SD of SAR process error"                                                 
[13,] "s_MS"          "True SAR by outmigration year"                                           
[14,] "tau_M"         "SD of smolt observation error"                                           
[15,] "tau_S"         "SD of spawner observation error"                                         
[16,] "M"             "True smolt abundance (not density) by outmigration year"                 
[17,] "mu_p_M"        "Popn mean smolt age distributions"                                       
[18,] "sigma_p_M"     "SDs of log-ratio cohort smolt age distribution"                          
[19,] "R_p_M"         "Correlation matrices of log-ratio smolt age distns"                      
[20,] "p_M"           "True smolt age distributions by brood year"                              
[21,] "q_M"           "True smolt age distributions by calendar year"                           
[22,] "R_MS"          "Correlation matrices of logit SAR by smolt age"                          
[23,] "mu_p_MS"       "Popn mean ocean age distributions for each smolt age"                    
[24,] "sigma_p_MS"    "SDs of log-ratio ocean age for each smolt age"                           
[25,] "R_p_MS"        "Correlation matrices of log-ratio ocean age distns"                      
[26,] "p_MS"          "True ocean age distns by outmigration year"                              
[27,] "q_MS"          "True ocean age distns of spawners"                                       
[28,] "q_GR"          "true Gilbert-Rich age distns of spawners"

DavidTallmon commented 4 years ago

Perfect. Thanks. (I'm sure you told me about par_defs a couple years ago, but I forgot. My bad.)

ebuhle commented 4 years ago

No problem, I sometimes forget how some of this stuff in salmonIPM works myself! (Exhibit A: the fact that I had to edit my last post after realizing I left out a key part of the "equation".)

This is an opportune time to mention that over the weekend, I finally tackled the "new and improved" priors on initial states in IPM_SMaS_np. Briefly, what I had done before was to put diffuse lognormal priors on smolt abundance M and spawner abundance S, and simplex-uniform priors on the smolt age distribution q_M and the Gilbert-Rich spawner age distribution q_GR, in the early years of the time series -- that is, the years before all age classes could be generated from previous years' states using the process model. That's an inefficient use of information, though, because some age classes during that initial window can in fact be generated from the process model. It's only the "orphan" age classes -- smolts that were born or spawners that outmigrated, respectively, before year 1 -- that need priors. By not using all the available information, I was depriving the process model of "data points" (really states) that could be used to help estimate the hyperparameters. This matters especially in short time series.

It took me an annoyingly long time to work out a general solution to this problem that, crucially, preserves the prior simplex-uniformity over the orphan age classes, and even longer to implement it in this particular model because the whole smolt age / ocean age bookkeeping is such a headache. After some teeth-gnashing, I'm pretty sure I got it right. The updates are in the master branch of salmonIPM now and available to use if you uninstall and reinstall from GitHub. And the code and results in this repo are updated in the develop branch, and will be in master pending approval of a PR.

Most of the changes are pretty subtle, but this rather striking one is relevant to the above discussion. The uncertainty in the early years of SAR and ocean age probabilities really tightens up, and it turns out that SAR of age-3 smolts has been relatively low in the past too:

DavidTallmon commented 4 years ago

I'm not successfully uninstalling from the master and installing from the develop branch. I tried:

devtools::uninstall_packages("ebuhle/salmonIPM") which I thought would uninstall the current salmonIPM, but I got Error: 'uninstall_packages' is not an exported object from 'namespace:devtools'

I had planned to follow with: devtools::install_packages("ebuhle/salmonIPM@develop") I'm assuming this will give me an error, too, but again, I'm not sure what is wrong w/ this syntax.

My devtools package is up to date. Can you help w/ proper un/install from the master and develop branches?

mdscheuerell commented 4 years ago

Hey @DavidTallmon, I think you'll have to use

remove.package(salmonIPM)

DavidTallmon commented 4 years ago

Weird. That didn't work at first. But, this did install.packages("salmonIPM") remove.packages("salmonIPM") I don't know how the package could've been removed before that. I couldn't find any code that I had removed it previously. Could it be related to a previous "pull'? But, continuing on, this did not work. devtools::install_packages("ebuhle/salmonIPM@develop") Error: 'install_packages' is not an exported object from 'namespace:devtools' I would like to install salmonIPM from the develop branch. Suggestions?

mdscheuerell commented 4 years ago

Yeah, the lingo is a bit confusing among the different install options. Here you want

devtools::install_github("ebuhle/salmonIPM@develop")

DavidTallmon commented 4 years ago

Of course. I should've seen that. Thanks.

DavidTallmon commented 4 years ago

Everything seemed to work, but got this failure message (at bottom). Is it really a failure, or can we just call it a very stern warning?
Using github PAT from envvar GITHUB_PAT Downloading GitHub repo ebuhle/salmonIPM@develop These packages have more recent versions available. It is recommended to update all of them. Which would you like to update?

1: All
2: CRAN packages only
3: None
4: backports (1.1.7 -> 1.1.8 ) [CRAN] 5: processx (3.4.2 -> 3.4.3 ) [CRAN] 6: rlang (0.4.6 -> 0.4.7 ) [CRAN] 7: vctrs (0.3.1 -> 0.3.2 ) [CRAN] 8: pillar (1.4.4 -> 1.4.6 ) [CRAN] 9: tibble (3.0.1 -> 3.0.3 ) [CRAN] 10: pkgbuild (1.0.8 -> 1.1.0 ) [CRAN] 11: loo (2.3.0 -> 2.3.1 ) [CRAN] 12: sys (3.3 -> 3.4 ) [CRAN] 13: xfun (0.15 -> 0.16 ) [CRAN] 14: dplyr (1.0.0 -> 1.0.1 ) [CRAN] 15: rstan (2.19.3 -> 2.21.2) [CRAN] 16: DT (0.14 -> 0.15 ) [CRAN] 17: rstantools (2.0.0 -> 2.1.1 ) [CRAN]

Enter one or more numbers, or an empty line to skip updates:1 backports (1.1.7 -> 1.1.8 ) [CRAN] processx (3.4.2 -> 3.4.3 ) [CRAN] rlang (0.4.6 -> 0.4.7 ) [CRAN] vctrs (0.3.1 -> 0.3.2 ) [CRAN] pillar (1.4.4 -> 1.4.6 ) [CRAN] tibble (3.0.1 -> 3.0.3 ) [CRAN] V8 (NA -> 3.2.0 ) [CRAN] pkgbuild (1.0.8 -> 1.1.0 ) [CRAN] loo (2.3.0 -> 2.3.1 ) [CRAN] sys (3.3 -> 3.4 ) [CRAN] xfun (0.15 -> 0.16 ) [CRAN] dplyr (1.0.0 -> 1.0.1 ) [CRAN] rstan (2.19.3 -> 2.21.2) [CRAN] DT (0.14 -> 0.15 ) [CRAN] rstantools (2.0.0 -> 2.1.1 ) [CRAN] Installing 15 packages: backports, processx, rlang, vctrs, pillar, tibble, V8, pkgbuild, loo, sys, xfun, dplyr, rstan, DT, rstantools Installing packages into ‘C:/Users/datallmon/R_LIBS’ (as ‘lib’ is unspecified)

There are binary versions available but the source versions are later: binary source needs_compilation backports 1.1.7 1.1.8 TRUE DT 0.14 0.15 FALSE

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/processx_3.4.3.zip' Content type 'application/zip' length 1147681 bytes (1.1 MB) downloaded 1.1 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/rlang_0.4.7.zip' Content type 'application/zip' length 1135164 bytes (1.1 MB) downloaded 1.1 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/vctrs_0.3.2.zip' Content type 'application/zip' length 1160471 bytes (1.1 MB) downloaded 1.1 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/pillar_1.4.6.zip' Content type 'application/zip' length 181902 bytes (177 KB) downloaded 177 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/tibble_3.0.3.zip' Content type 'application/zip' length 416247 bytes (406 KB) downloaded 406 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/V8_3.2.0.zip' Content type 'application/zip' length 9193571 bytes (8.8 MB) downloaded 8.8 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/pkgbuild_1.1.0.zip' Content type 'application/zip' length 141218 bytes (137 KB) downloaded 137 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/loo_2.3.1.zip' Content type 'application/zip' length 843911 bytes (824 KB) downloaded 824 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/sys_3.4.zip' Content type 'application/zip' length 59680 bytes (58 KB) downloaded 58 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/xfun_0.16.zip' Content type 'application/zip' length 232731 bytes (227 KB) downloaded 227 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/dplyr_1.0.1.zip' Content type 'application/zip' length 1309742 bytes (1.2 MB) downloaded 1.2 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/rstan_2.21.2.zip' Content type 'application/zip' length 5339661 bytes (5.1 MB) downloaded 5.1 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/rstantools_2.1.1.zip' Content type 'application/zip' length 160718 bytes (156 KB) downloaded 156 KB

package ‘processx’ successfully unpacked and MD5 sums checked Error: Failed to install 'salmonIPM' from GitHub: (converted from warning) cannot remove prior installation of package ‘processx’

ebuhle commented 4 years ago

So, first off, there's no need to install the develop branch of salmonIPM because

The updates are in the master branch of salmonIPM now and available to use if you uninstall and reinstall from GitHub.

The updated code in the develop branch of this repo is still awaiting approval of a PR, which I might just go ahead and approve myself as long as you (or @mdscheuerell) don't have any uncommitted changes sitting in your local master.

As for the error, try this: from a fresh R session, do update.packages(checkBuilt = TRUE, ask = FALSE). Then start a new session, install salmonIPM, and decline any further updates (3: None). It's not clear to me that this is actually the reason for the error, but at least we can rule it out.

mdscheuerell commented 4 years ago

The updated code in the develop branch of this repo is still awaiting approval of a PR

Done. Sorry I missed that previously.

which I might just go ahead and approve myself

In the future, @ebuhle should just go ahead and approve PR's.

ebuhle commented 4 years ago

Cool, thanks. Not trying to make extra work for you, just to avoid the multiple-cooks-in-the-kitchen-at-the-same-time problem. :grin:

DavidTallmon commented 4 years ago

So, back to the proportions of jacks/age structure of spawners and the disparity between datasets (at weir vs scales)..... I talked to Josh today. The scale samples are not random or systematic samples....they are taken from about 20% of individuals as the opportunity presents itself. In contrast, the visual calls of sex/age at the weir are of all individuals as they are passed upstream. Given that, the visual calls seem the better data for likelihood construction and interpretation. Also, Josh confirmed that < 20 CWT jacks have been reported as harvested by ADFG since 1980, and the ADFG expansion is that < 34 jacks have been harvested over the time series. Seems like we can safely say not subject to harvest.

ebuhle commented 4 years ago

So, back to the proportions of jacks/age structure of spawners and the disparity between datasets (at weir vs scales)..... I talked to Josh today. The scale samples are not random or systematic samples....they are taken from about 20% of individuals as the opportunity presents itself. In contrast, the visual calls of sex/age at the weir are of all individuals as they are passed upstream. Given that, the visual calls seem the better data for likelihood construction and interpretation.

Oh dear. That's too bad. Without the full FW/SW age, there's no way to identify the SAR and ocean age-proportions specific to each smolt age, which renders a big chunk of this model moot, or at least underdetermined.

I mean, I guess technically I should say that other likelihood components (total spawner abundance and ocean age composition) might partially identify or bound those parameters / states -- in a very general sense, that's part of the rationale for IPMs -- but I strongly suspect that the information will be weak at best. Of course we can try it and find out, but this seems like a major wrench in the works.

On another note, were you able to (re)install salmonIPM?

DavidTallmon commented 4 years ago

Uh-oh. Is there not a way to use full FW/SW age info from the scales from the sampled 20% of spawners to infer the smolt ages of the 100% of spawners IDd to SW age visually at the weir?

On the other note, I am still encountering some frustrations. For example:

update.packages(checkBuilt = TRUE, ask = FALSE) Warning in install.packages(update[instlib == l, "Package"], l, repos = repos, : 'lib = "C:/Program Files/R/R-4.0.2/library"' is not writable Error in install.packages(update[instlib == l, "Package"], l, repos = repos, : unable to install packages

I have admin privileges, so I'm not sure why /library is not writable. I also noticed that many of the recently installed packages have installed to

C:\Users\datallmon\R_LIBS

I'm guessing this is all related to the synch weirdness from my uni that I'm not clever enough to outsmart. This is a super-frustrating indicator of that:

devtools::install_github("ebuhle/salmonIPM") Error in loadNamespace(name) : there is no package called ‘devtools’ install.packages("devtools") Installing package into ‘\sync-al/files$/JD-880061-A2/datallmon/My Documents/R/win-library/4.0’ (as ‘lib’ is unspecified) also installing the dependencies ‘utf8’, ‘lifecycle’, ‘pillar’, ‘pkgconfig’, ‘vctrs’, ‘BH’, ‘sys’, ‘ini’, ‘tibble’, ‘backports’, ‘ps’, ‘lazyeval’, ‘base64enc’, ‘later’, ‘askpass’, ‘highr’, ‘markdown’, ‘xfun’, ‘clipr’, ‘crayon’, ‘curl’, ‘fs’, ‘gh’, ‘git2r’, ‘glue’, ‘purrr’, ‘rematch2’, ‘rprojroot’, ‘whisker’, ‘yaml’, ‘processx’, ‘R6’, ‘assertthat’, ‘fansi’, ‘digest’, ‘rex’, ‘htmltools’, ‘htmlwidgets’, ‘magrittr’, ‘crosstalk’, ‘promises’, ‘mime’, ‘openssl’, ‘prettyunits’, ‘xopen’, ‘brew’, ‘commonmark’, ‘knitr’, ‘Rcpp’, ‘stringi’, ‘stringr’, ‘xml2’, ‘evaluate’, ‘praise’, ‘usethis’, ‘callr’, ‘cli’, ‘covr’, ‘desc’, ‘DT’, ‘ellipsis’, ‘httr’, ‘jsonlite’, ‘memoise’, ‘pkgbuild’, ‘pkgload’, ‘rcmdcheck’, ‘remotes’, ‘rlang’, ‘roxygen2’, ‘rstudioapi’, ‘rversions’, ‘sessioninfo’, ‘testthat’, ‘withr’

ebuhle commented 4 years ago

Uh-oh. Is there not a way to use full FW/SW age info from the scales from the sampled 20% of spawners to infer the smolt ages of the 100% of spawners IDd to SW age visually at the weir?

I'm not sure I follow. What we've been doing is using the scale samples of FW/SW age to inform the estimates of SAR and conditional age at maturity for each smolt age, for the entire population (so, yes, the 100% of visually aged spawners at the weir). You need observations of full FW/SW age to do this -- ocean age alone can't distinguish the fates of 2- vs. 3-year-old smolts -- so in that sense, the scale samples are essential and the visual observations of SW age are a nonessential bonus that provides some additional information.

The problem is that those two data types are in conflict, which we can see by comparing the piece of information (SW age) they have in common. That's why the model is having a hard time fitting either one, despite the high precision implied by the multinomial likelihood with large sample sizes. And now we have an explanation: the scale samples aren't representative of the population as a whole. Even if we were to ignore the SW age from the scales and only use the FW age (maybe what you were suggesting?) it wouldn't change this fact, just hide it. Anyway, I don't think the FW age alone would be enough to identify the states / params in question -- you need the full 2 x 2 contingency table, as it were, not just the margins. Maybe there'd be some sort of indirect partial identifiability like I alluded to earlier, but again, it's a moot point if the scales are a severely biased sample from the population.

Does that make sense? Or am I missing something?

On the other note, I am still encountering some frustrations. For example:

Hrrrmm. Yeah, this does seem like some kind of Big Brother IT situation. (Who needs funding cuts when you can just prevent researchers from doing research?) And yet you've obviously managed to install and use packages before, so that's weird. What is .libPaths()? It may be that either (A) you have somehow changed the default library directory, or (B) you will need to change the default library directory to evade Big Brother. I never, ever futz with the default library and I always get confused by this stuff, so I would need to consult the interwebz for instructions, but let's see what it is currently.

ebuhle commented 4 years ago

OK, so thinking about this as a multi-way contingency table analysis gave me an idea for a possible Hail Mary. Tell me if this makes sense.

We know that jacks are generally underrepresented in the opportunistically collected scale samples. I can think of two reasons why this might be the case (maybe there are others?):

a bias toward taking scales from 1SW fish.
a bias in scale sampling with respect to run timing, if jacks tend to come in early / late or if fish of a given SW age are clustered in time and sampling is only done on certain days.

If it's (1), then there's no immediately obvious reason why scale samples within each ocean age would be biased w.r.t. FW age. If it's (2), then who knows -- I could imagine that 2FW and 3FW fish come in at different times. If the date of each sample was recorded, then I suppose we could test this empirically.

Anyway, if (big, crucial IF) we can safely assume the scale samples within each SW age are unbiased w.r.t. FW age, then we could combine the visual SW age observations with the FW age from the scales to estimate what we need. Consider a toy example of age-frequency data for a given year (year indices are omitted for clarity), formulated as a 2 x 2 contingency table:

FW \ SW	0	1
2	`n_GRage2_2_obs`	`n_GRage3_2_obs`
3	`n_GRage3_3_obs`	`n_GRage4_3_obs`

The cell frequencies are the usual Gilbert-Rich age data as currently defined in IPM_SMaS_np, which we get from scale samples. We model them using a multinomial likelihood with the following cell probabilities (again, I'm dropping the year index t from q_GR[t,a] to focus on age class a):

FW \ SW	0	1
2	`q_GR[1]`	`q_GR[2]`
3	`q_GR[3]`	`q_GR[4]`

These cell probabilities are a function of several underlying states, including the smolt age-specific SAR and conditional age-at-maturity probabilities, but the important point is that they contain all the information about spawner age that we need to identify those states (and the hyperparameters that govern them).

The problem we've just discovered is that the observed cell frequencies n_GRage_obs are not generated by (even pseudo-) random sampling from the population, so the multinomial likelihood is invalid; we cannot use it to estimate the cell probabilities q_GR. But suppose the nonrandomness is only w.r.t. SW age, and that FW age is randomly sampled within each SW age. In contingency table parlance, we have fixed column margins, as would be the case in a designed experiment. It's as if we deliberately sampled some arbitrary number of scales from jacks n0_obs and 1SW fish n1_obs:

FW \ SW	0	1
2	`n_GRage2_2_obs`	`n_GRage3_2_obs`
3	`n_GRage3_3_obs`	`n_GRage4_3_obs`
Total	`n0_obs`	`n1_obs`

The column totals are meaningless, but the row-frequencies within each column allow us to estimate the conditional FW age probabilities q_FW:

FW \ SW	0	1
2	`q_FW[1]`	`q_FW[2]`
3	`q_FW[3]`	`q_FW[4]`

where, e.g., q_FW[1] = q_GR[1] / (q_GR[1] + q_GR[3]) and q_FW[4] = q_GR[4] / (q_GR[2] + q_GR[4]). (Yes I know the notation is tortured, but I'm trying to keep the variable names in tables 1 & 2 as they currently exist, rather than starting from scratch with what makes sense mathematically in this toy example.) So the multinomial likelihood becomes two multinomials (binomials). In cartoon form:

[n_GRage2_2_obs, n_GRage3_3_obs] ~ multinomial([q_FW[1], q_FW[3]], n0_obs)  # P(FW age | SW age = 0)
[n_GRage3_2_obs, n_GRage4_3_obs] ~ multinomial([q_FW[2], q_FW[4]], n1_obs)  # P(FW age | SW age = 1)

Now we need to estimate the column-marginal ocean age probabilities in the population:

FW \ SW	0	1
2	`q_GR[1]`	`q_GR[2]`
3	`q_GR[3]`	`q_GR[4]`
Total	`q_MS[1]`	`q_MS[2]`

where q_MS is defined just as it currently is in the model. We use the visual ocean-age frequencies n_MSage_obs to do this, exactly as we've been doing all along:

[n_MSage0_obs, n_MSage1_obs] ~ multinomial([q_MS[1], q_MS[2]], n_MSage0_obs + n_MSage1_obs)

Conceptually, the conditional FW age probabilities and the marginal SW age probabilities are equivalent to the full Gilbert-Rich age probabilities that we need. E.g., q_GR[1] = q_FW[1] * q_MS[1] and so on.

Again, this is ONLY valid if the scale samples within each ocean age are unbiased w.r.t. smolt age.

Thoughts?

DavidTallmon commented 4 years ago

Sorry, I've been radio silent while going through the stages of grief. I think you've just saved me a couple final stages and articulated what I was trying and failing horribly to describe.....the SW scales should still be useful samples of their FW age composition even if SW scales are way more likely to be aged from 1 SW. The tables make sense after a walk-through. This could be fantastic. But, let me pass this by Scott and Josh for b.s. detection......I may be ignoring a fundamental sampling problem because I want so desperately for this to work. It seems like a brilliant work around.

ebuhle commented 4 years ago

Well, why didn't you just say so?!? :rofl: :laughing: :cry: :smile: :facepalm:

No, just kidding. This stuff is so much easier to hash out with a whiteboard. Another casualty of COVID. Glad I could save you depression and acceptance, at least. FWIW, I was feeling pretty bummed too, after all the work we've put into this and having just started making some real headway. I was hoping that somehow the visual age data would turn out to be bogus and the scales would be legit, because we could just drop the former and be fine. Was pretty excited when I hit upon this idea. Even better, I think it would be dead simple to implement. You could just pass in a logical flag conditionGRonMS that, if TRUE, would cause q_GR to be conditioned on the column totals as shown above instead of the usual unconditional probabilities. Then (I'm pretty sure, although I need to double-check this) the actual likelihood statement wouldn't even change.

I'm tempted to go code it up right now, but yes, we should definitely run this by @svulstek and Josh first. The thing that gives me pause is that I don't see how we could validate the assumption of random (-ish) sampling of FW age conditional on SW age; it'd be a leap of faith. But hopefully those guys have a sense of whether it's well-justified. Fingers crossed.

mdscheuerell commented 4 years ago

Thoughts?

This sounds like a clever and opportunistic way of addressing this problem.

ebuhle commented 4 years ago

"There's such a fine line between stupid and clever."

DavidTallmon commented 4 years ago

It's on again. I just talked with the guys at the weir and they're confident that at least in the last decade there is no reason to think there's biased sampling of FW ages w/i SW ages.

ebuhle commented 4 years ago

OK, I got the new version with conditionGRonMS = TRUE working.

Most of the substantive results are...surprisingly not that different. See for yourself in the rendered HTML. The one major exception, of course, is that the fits to the spawner age-frequency data look a lot better and more sensible:

Alas, this doesn't do anything to cure the bad-to-very-bad Pareto ks that make the LOO estimates unusable, but you can't win 'em all. I'm just happy we were able to salvage the scale data and continue to fit the same underlying models!

DavidTallmon commented 4 years ago

Yes. Nice work. Those age data fits definitely look better. A relief. So, do we move onto model comparisons w/ covariates or is that tenuous w/o a properly functioning LOO?

ebuhle commented 4 years ago

We can certainly go ahead and fit models with covariates, even if we don't yet have reliable machinery for doing model comparison based on out-of-sample predictive performance. The posteriors of the regression coefficients are also a valid way to look at effects; it's just asking a different question.

I may have to take a couple days off to get some "official" work done, though. :smile:

DavidTallmon commented 3 years ago

I'm hoping to make a push on this over the next few weeks. I'm feeling inspired and have a drop in my teaching load for awhile. My plan is to nuke everything and re-install R-studio, etc., with the hopes of evading UAS' synching and correlated issues for a few weeks. That seemed to work for awhile this summer. I could definitely use help formally describing the obs and process model likelihoods if anybody else feels similarly inspired (and flush with free time).

mdscheuerell commented 3 years ago

I can help with the model descriptions next week.

DavidTallmon commented 3 years ago

Hey Mark, did you have any luck finding time to work on the model descriptions? I hate to nag, but am feeling the pinch from NPRB. I think I have enough in the way of results to make them happy, but my model description attempts are pretty comical and likely dangerous.

mdscheuerell commented 3 years ago

Hey David. No, I didn't yet, but I can do so this weekend if you can wait that long.

DavidTallmon commented 3 years ago

Sounds great. Thanks, Mark.

ebuhle commented 3 years ago

Hey @DavidTallmon, I said this already via email, but I suspect I'm going to have to run point on translating IPM_SMaS_np from Stan to mathematical English. I might be able to devote some time to that later this week/end. (I mean, it's not like there's anything else going on, right?)

When I asked about NPRB, I didn't realize this work was already part of an existing proposal. We should talk about what that means, and the possibilities for dedicated funding; feel free to do that over email if that's better.

ebuhle commented 3 years ago

My plan is to nuke everything and re-install R-studio, etc., with the hopes of evading UAS' synching and correlated issues for a few weeks.

Any success (or failure) on this front?

mdscheuerell / AukeCoho

Additional code for summarizing model output #2

I have admin privileges, so I'm not sure why /library is not writable. I also noticed that many of the recently installed packages have installed to

I'm guessing this is all related to the synch weirdness from my uni that I'm not clever enough to outsmart. This is a super-frustrating indicator of that: