Preparation for Beta Release 4.4.1

OVERVIEW: The task list for the current set of enhancements and bugs is growing small. It's time to start thinking about the to-do list for the next release.

RESOLUTION: Set up a list of things to do, so as to maintain timelines and expectations.

DETAILS: In order to ensure a smooth transition for the next release, the following needs to occur.

:white_check_mark: Complete the Big Looper on all reports. This includes reports with dependencies on updated queries and helper functions. COMPLETE -- 3/24/2016

a. :white_check_mark: Estimates production by life stage and run. COMPLETE -- 3/24/2016 b. :white_check_mark: Estimates production for ALL runs. COMPLETE -- 3/24/2016 c. :white_check_mark: View all catch records. COMPLETE -- 3/28/2016 d. :white_check_mark: Export non-Chinook catch records. COMPLETE -- 3/24/2016 e. :white_check_mark: Sum Chinook by date. COMPLETE -- 3/29/2016 f. :white_check_mark: Summarize releases. COMPLETE -- 3/24/2016 g. :white_check_mark: Plot fork length through season. COMPLETE -- 3/25/2016 h. :white_check_mark: Plot histograms of fork length (lifeStage = FALSE). COMPLETE -- 3/25/2016 i. :white_check_mark: Plot histograms of fork length (lifeStage = TRUE). COMPLETE -- 3/25/2016 i. :white_check_mark: Plot the weekly effort over time. COMPLETE -- 3/24/2016

:fish: Note that in running the Big Looper, both the "Plot fork length through season" and "Plot histograms of fork length" erred. This is due to data frame catch.df now having a column entitled oldTrapPositionID, due to our work with gaps in fishing, i.e., Issue #71. I have updated the code for each, in the appropriate spot, to incorporate this update since the last time these were checked via the Big Looper, i.e., the work updates detailed in Issues #30 and #31.

:fish: I had never run the "Sum Chinook by date" report before in the Big Looper, as we have never done any updates / tweaks with it since I've been here. When I ran it however, it crashed continuously. I eventually figured out that once R is done running Connie's queries, it wasn't closing the connection to Access via a close statement. Adding this led to the reports completing as expected, over all iterations of the Big Looper. This minor update has been recorded via Issue #85.
:white_check_mark: Checking results. COMPLETE -- 3/29/2016

a. :white_check_mark: Check passage estimates resulting from reports "Estimates production by life stage and run" and "Estimates production for ALL runs" for realistic numbers. COMPLETE -- 3/18/2016
1. :white_check_mark: American -- see Issue #78. COMPLETE -- 3/18/2016
2. :white_check_mark: Sacramento (RBDD) -- see Issue #79. COMPLETE -- 3/18/2016
3. :white_check_mark: Mokelumne -- see Issue #80. COMPLETE -- 3/18/2016
4. :white_check_mark: Knight's Landing -- see Issue #81. COMPLETE -- 3/18/2016
5. :white_check_mark: Stanislaus -- see Issue #82. COMPLETE -- 3/18/2016
6. :white_check_mark: Feather -- see Issue #83. COMPLETE -- 3/18/2016
:o: Compare estimates with external estimates. ONGOING -- 3/30/2016

1.:o: American -- Having this check complete is not necessary for this beta release.
1. :white_check_mark: Sacramento (RBDD) -- see Issue #79. Results were different between the Platform and the biologist-derived process currently utilized on this river. Two possible reasons for the discrepancy are due to (1) the multi-year trap efficiency model used by the RBDD biologists, and (2) the RBDD trap efficiency model that uses % of river discharge sampled by the RSTs as a way of estimating trap efficiency. There could be other explanations, however. COMPLETE -- 3/28/2016
2. :white_check_mark: Mokelumne -- see Issue #80. Comparisons were made to the passage estimates obtained from the "EBMUD" process. Results obtained from EBMUD generally agreed with that obtained from the Platform. COMPLETE -- 3/18/2016
3. :o: Knight's Landing -- Having this check complete is not necessary for this beta release.
4. :o: Stanislaus -- Having this check complete is not necessary for this beta release.
5. :o: Feather -- Having this check complete is not necessary for this beta release.
c. :white_check_mark: Upload reports to the FTP site, into folder Big Looper Beta Release 4.5.1 Output in folder For Doug. This is ideally done either early in the morning or on the weekend, when the FTP site generally lacks use. COMPLETE 3/30/2016
1. :white_check_mark: American -- 21,986 KB -- 451 files -- COMPLETE -- 3/29/2016
2. :white_check_mark: Feather -- 144,505 KB -- 4,343 files -- COMPLETE -- 3/30/2016
3. :white_check_mark: Sacramento (RBDD) -- 160,468 KB -- 2,612 files -- COMPLETE -- 3/30/2016
4. :white_check_mark: Mokelumne -- 16,232 KB -- 466 files -- COMPLETE -- 3/29/2016
5. :white_check_mark: Knight's Landing -- 9,176 KB -- 247 files -- COMPLETE -- 3/29/2016
6. :white_check_mark: Stanislaus -- 50,754 KB -- 1,357 files -- COMPLETE -- 3/30/2016
d. :white_check_mark: Upload all passage results for people to see / analyze. All passage estimates contained in this Excel spreadsheet. allEstsCompare.xlsx COMPLETE -- 3/28/2016.
:white_check_mark: Clean out the R-Interface folder of any old programs, files, etc. no longer needed. COMPLETE -- 3/17/2016
:white_check_mark: Clean out the Outputs folder of any old testing output. COMPLETE -- 3/28/2016
:white_check_mark: Version Control in the Platform. See Issue #55. COMPLETE -- 3/28/2016

a. :white_check_mark: Place a zip file containing the release name in folder R-Interface, so the appropriate field ends up populated within the Platform. COMPLETE -- 3/17/2016

b. :white_check_mark: Check that the zip file is working correctly. COMPLETE -- 3/28/2016
:white_check_mark: Move all updated programs to the appropriate folder in the local C: drive, for Platform testing. COMPLETE -- 3/29/2016
:white_check_mark: Test, directly via the Platform, that all reports work as intended. All of the following zip files should have an run_R.out text file, particular to that Platform report run. COMPLETE -- 3/29/2016

a. :white_check_mark: Estimates production by life stage and run. COMPLETE -- 3/29/2016 by lifestage and run.zip b. :white_check_mark: Estimates production for ALL runs. COMPLETE -- 3/29/2016
1. :white_check_mark: day. COMPLETE -- 3/29/2016 all runs -- day.zip
2. :white_check_mark: week. COMPLETE -- 3/29/2016 all runs -- week.zip
3. :white_check_mark: month. COMPLETE -- 3/29/2016 all runs -- month.zip
4. :white_check_mark: year. COMPLETE -- 3/29/2016 all runs -- year.zip
c. :white_check_mark: View all catch records. COMPLETE -- 3/29/2016 all catch records.zip d. :white_check_mark: Export non-Chinook catch records. COMPLETE -- 3/29/2016 non-chinook records.zip e. :white_check_mark: Sum Chinook by date. COMPLETE -- 3/29/2016 sum chinook.zip f. :white_check_mark: Summarize releases. COMPLETE -- 3/29/2016
1. :white_check_mark: fall. COMPLETE -- 3/29/2016 summarize releases -- fall.zip
2. :white_check_mark: late fall. COMPLETE -- 3/29/2016 summarize releases -- late fall.zip
3. :white_check_mark: spring. COMPLETE -- 3/29/2016 summarize releases -- spring.zip
4. :white_check_mark: winter. COMPLETE -- 3/29/2016 summarize releases -- winter.zip
g. :white_check_mark: Plot fork length through season. COMPLETE -- 3/29/2016
1. :white_check_mark: fall. COMPLETE -- 3/29/2016 plot forklength -- fall.zip
2. :white_check_mark: late fall. COMPLETE -- 3/29/2016 plot forklength -- late fall.zip
3. :white_check_mark: spring. COMPLETE -- 3/29/2016 plot forklength -- spring.zip
4. :white_check_mark: winter. COMPLETE -- 3/29/2016 plot forklength -- summer.zip
h. :white_check_mark: Plot histograms of fork length. COMPLETE -- 3/29/2016
1. :white_check_mark: lifeStage = TRUE COMPLETE -- 3/29/2016
  
  a. :white_check_mark: fall. COMPLETE -- 3/29/2016 plot histograms -- fall -- lifestage=yes.zip b. :white_check_mark: late fall. COMPLETE -- 3/29/2016 plot histograms -- late fall -- lifestage=yes.zip c. :white_check_mark: winter. COMPLETE -- 3/29/2016 plot histograms -- winter -- lifestage=yes.zip d. :white_check_mark:. spring COMPLETE -- 3/29/2016 plot histograms -- spring -- lifestage=yes.zip
2. :white_check_mark: lifeStage = FALSE COMPLETE -- 3/29/2016
  
  a. :white_check_mark: fall. COMPLETE -- 3/29/2016 plot histograms -- fall -- lifestage=no.zip b. :white_check_mark: late fall. COMPLETE -- 3/29/2016 plot histograms -- late fall -- lifestage=no.zip c. :white_check_mark: winter. COMPLETE -- 3/29/2016 plot histograms -- winter -- lifestage=no.zip d. :white_check_mark:. spring COMPLETE -- 3/29/2016 plot histograms -- spring -- lifestage=no.zip
i. :white_check_mark: Plot the weekly effort over time. COMPLETE -- 3/29/2016 weekly effort.zip

j :white_check_mark: Mokelumne BYPASS COMPLETE -- 03/29/2016 Moke BYPASS Date 1.zip Moke BYPASS Date 2.zip

:fish: Platform checking discovered that I had failed to turn off helper function accounting.R, which I made to help me keep track of fish as I was doing the plus-counts update. The call to this function has been turned off in both lifestage_passage.R, run_passage.R, est_catch.R.

:fish: Platform checking revealed that I forgot to take out a pred <<- pred statement in the bootstrapping function. Syntax of this type puts data frames created within a local function environment to the global environment. But, they don't work if the first pass through doesn't create this pred data.frame (due to no missing imputed catch values for that particular first trapping instance). It took a while to figure it out, but in the end, I commented out the offending line.

:fish: I also rechecked the output requested a few months ago regarding the Mokelumne Woodbridge BYPASS plots. I found that in the interim, these reports had become broken, due to our updating the get_catch_data.R for either (or both) of halfCone adjustments and gaps in fishing. In any case, I fixed that program so that these reports would work again. I ran the "by lifestage and run" report for an RBDD year to make sure I didn't break things all over again with these small changes, with output appearing as expected.
:white_check_mark: Sync all current code from the R-Interface to the master branch of GitHub. COMPLETE 3/30/2016
:o: Bring in Jared's fork to the master on GitHub. SUSPENDED 3/24/2016
:o: Run Big Looper passage estimates based on assigning life stage automatically (2-groups), utilizing all updated code intended for the upcoming release. SUSPENDED 3/24/2016
:o: Run Big Looper passage estimates based on assigning life stage automatically (3-groups), utilizing all updated code intended for the upcoming release. SUSPENDED 3/24/2016
:o: Run Big Looper passage estimates based on letting the life stage assigner decide the number of groups, utilizing all updated code intended for the upcoming release. SUSPENDED 3/24/2016
:o: Run code to compile all passage results from life stage considerations, and post for people to use / analyze. SUSPENDED 3/24/2016
:o: Resolve any and all issues arising from Big Looper runs. SUSPENDED 3/24/2016
:o: Sync any code updates to the master branch on GitHub. SUSPENDED 3/24/2016
:white_check_mark: Move a zipped copy of the updated R code to folder Interim R Releases on the FTP site. COMPLETE 3/30/2016
:white_check_mark: Compile release notes for current release, and formally release the release. **COMPLETE -- 3/30/2016

At this point, the updated R code is ready for release, and the next stage in the checking process.

This Issue has been edited extensively, in order to detail the process currently utilized to prepare for a release. You may wish to review it, since I don't believe direct edits of previously posted Issues send out update emails.

Release Doug Foxtrot was released on 3/30/2016. I think maybe it sent an email to all watchers, confirming its upload.

(from Doug)

The attached file provides my initial review comments on the 03 29 2016 R code and how it affects the Production by Life Stage and Run report. Start with the Readme work and focus on the text in red. I have some concerns in regards to a few of the imputed catch numbers. After you have time to digest my spreadsheet, let’s get on the phone for a phone call.

Production by Life Stage and Run report.zip

I need some feedback from Jason: I have not attempted to validate the: (1) trap efficiency data or (2) the catch data on days when catch was not imputed. Doing that takes a LOT of effort. When I got the 06 17 2015 R code, the R code was processing those 2 kinds of data the way I expected. Because I have limited knowledge of how the code was modified in 2016, I don’t know if Jason adjusted the code for those two items or if it has been static. Please give me a recommendation: do I need to go back and validate the non-imputed catch values and trap efficiency data, or can I safely assume the code for those things has not been changed and so I don’t need to validate those kinds of data?

OVERVIEW: I have spent a decent amount of time investigating Doug's spreadsheet. I see his observations encompassing three separate areas of investigation/commentary, when comparing passage for American River estimates in years 2013, 2014, 2015, from Jan 1 through Aug 1.

Something is wrong with the catch.png.
Imputation for some days, in the updated code, appears different.
Trap 57005, in year 2013, experienced a gap in fishing.

RESOLUTION: The easy answers to Doug's three areas of concern are

Something did happen to the png;
Imputation appears to be working correct, as currently defined;
The gap-in-fishing logic appears to be working as intended.

DETAILS:

Even though Doug tested three separate years' worth of data, I focused on 2013, if only because

There is the gap in fishing for trap 57005, and
This is what Doug did.

The catch.png

The problem with the catch.png had been previously observed, and fixed. I think it bled back into the working set of R code by my utilizing two separate branches in GitHub at the same time. This means that work that focused on one set of updates was currently being stored in one location, while work that focused on a separate set of updates was stored in a different location. However, the set of work comprised two complete (but slightly separate) versions of the same exact files (the R programs). Storing work in this way is how GitHub operates. I have found that sometimes, the work I intended for the one version ends up in the other, and vice versa. This creates problems when I go to merge it all together, and I keep the wrong sections of code from each program. I'm sorry I didn't catch this before it went out the door. It has already been fixed. I have already resolved to not store programs in two branches (the GitHub term) in this way ever again.

I stress this was a stupid mistake on my part, and all the data necessary for plotting make it to this point in the code. I'm just plotting the wrong variable.

Imputation

We have made a lot of changes to the data and process on which imputation depends. These include

Issue #15 Issue #48 Issue #70 Issue #71 Issue #73 (Not applicable to the American) Issue #74 Issue #76 Issue #77

So generally, we should expect at least some type of change in the reported imputation numbers, when comparing the old versus the new.

In order to try and winnow down the possibilities (which include a possible error), I ran the 2013 production runs on each of the old and new set of code. This emulates what Doug did in his worksheet "issue #1 code not ok" and his attached workbook above.

Keep in mind that models are fit on a per-trap basis, which compartmentalizes the imputation investigation to each independent trapPositionID. So, I focused first on trap 57003, whose data is in rows 211-234. This is the trap of the shortest duration; in theory, I assumed then that finding the reason for the alleged discrepancy on March 4th and March 5th would be easiest for this trap.

In the old code, as highlighted in red in cells F229 and F230, it appears that the spline is decreasing over this time period. In the new code, as highlighted in red in cells V229 and V230, it appears the spline is increasing over this time period. In both cases, the imputed numbers appear to be of the same general order of magnitude of this trap's observed catch. Further, the change in the imputed catch over this one day, in both cases, is approximately the same magnitude (a few hundred fish or so), even if they go in opposite directions. It is worth pointing out that for this trap, there are no strings of preceding or antecedent zeros. So, in this case, the modification made to fix the exploding confidence interval changes nothing in the data for this trap (and hence the statistics).

Given the sign change, I did the practical thing: I looked at the spline used to estimate the imputation. In the old code, I discovered that a quadratic (i.e., a parabola) is being fit to the catch data for trap 57003. In the new code, a line with slope is being used. (I say a "line with slope" to differentiate from a horizontal line, which has no slope.) So, a simpler model is being used to fit the data in the new code.

Recall that the process used to determine the complexity of the spline to be used is iterative. It first (and always) calculates an intercept-only model (which is the same thing as a line with no slope). Next, it calculates a line with trend, and assuming success, calculates the so-called Akaike Information Criterion (AIC). Having also calculated the AIC for the model with no slope, it then compares the two. AICs of more complex models (the linear model with slope) that are less by their comparatively simpler models (the linear model with no slope) by more than 2 (the generally accepted value promoted in the literature) allow us to conclude that the more complex model is worth keeping. Note that only the difference in AIC between two models matters. AICs for models calculating great distances in space may have individually huge AICs, while AICs for models involving organelles of cells may have minuscule AICs. Only the difference matters.

Practically, in fitting a model to the temporal trend of catch, we start with a horizontal intercept-only model. If the data support it, we add a linear trend. If the data support more, we add a quadratic term to the model. If the data support even more, we add a cubic trend to the model. Note that up to this point, we're just building a third-degree polynomial, term-by-term. If the data support even more than this, we start busting out cubic splines. In many cases (such as this example that covers a short time span), we never need to use splines.

To make this concrete, this is how the model builds up to the quadratic used in the old code for trap 57003.

Fit the intercept-only model and get its AIC. This value was 29,299.87. By itself, the value of 29,299.87 means nothing.
See if we can improve the AIC. So, fit a more complex model. Fit the linear model with trend and get its AIC. This value was 27,678.01. By itself, the value of 27,678.01 means nothing. However, we note that 27,678.01 is less than 29,299.87. This means that the linear model with trend is possibly better than the intercept-only model.
Calculate the difference in AIC between these two models. So, see that 29,299.87 - 27,678.01 >> 2. Recall that ">>" means "much greater than." So, we conclude that the model with trend is a significant improvement over the model with no trend (the intercept-only model), since the difference is clearly a lot bigger than 2. So, the second model is definitely better.
We want to know if we can do better than a linear model with trend. So, we add a quadratic term and get the AIC from that new model. This model's AIC was 12,497.61. This is clearly another big improvement from 27,678.01. So, we know we want at least a quadratic.
We want to know if we can do better than a quadratic. So, we add a cubic term and get the AIC from that new model. This model's AIC was 12,648.73. Note how the AIC went up. So, the difference between this model and the quadratic (the next-best option) cannot be greater than a positive 2. So, we conclude that the cubic is too much, and call it a day with the quadratic.

This is what the old code did.

Now, for the new code. Actually, the new code does the same exact thing in terms of the AIC differencing. This is a fundamental statistics thing, and is not unique to splines or GAMs or this project. What is different is all the Issues that have been implemented since the last round of code. But the important one is Issue #76: Modification of Knots / Degrees of Freedom Used in Spline Methodology.

Before, we let pure statistics dictate when a more complex model could be introduced in lieu of a simpler one. Recall that in Issue #74, I describe a short 11-day trapping instance in which the spline utilized was overly complex; in that case, the trend was estimated to go up, and then down, and then up, and then down, and so on. Statistically, this was the best model. Biologically, I don't think you could sell a practical reason as to why catch would ebb and flow like that over such a short period (except for random variation).

Additionally, keep in mind, via Issue #71, these short trapping periods became more important, due to the incorporation of Connie's gap-in-fishing logic. Due to gaps-in-fishing slicing and dicing catch periods, we suddenly saw several short trapping periods of a few days that suddenly started to receive their own catch model. The implementation of Issue #76 was in direct response to behavior that originated with Issue #74, which manifested due to Issue #71.

Practically, Issue #76 slows down the process by which a simpler spline model can graduate to a more complex model. The rule we came up with is that each new, more complex, model had to have at least an additional 15 data points to support it. This means that if a spline was fit to a trap that contained only 14 data points, it will always be an intercept-only model. Similarly, a trap that only has 29 data points can never possibly be a quadratic, which would require at least 30. Note that for many rivers, this means that for our spline models to conclude the existence of a parabola, we need at least 30 data points, since many rivers collect trap data daily. For those that collect it twice a day, a parabola could be concluded over about 15 days.

And so, here it is. The number of data points for trap 57003 being fed to the spline is only 22. So, by design, a quadratic will never be fit here. The AICs above, from the old code, are almost exactly the same as they are for the new code. (Note that I am assuming that the small changes are due to other changes, i.e., the other Issues, etc.) This means that we're ignoring (in this case!) a strong statistical result in deference to what we like to think is practicality, as far as the biology is concerned. Given the large reduction in AIC between the linear and quadratic models (which the new code ignores), I could believe that by settling for the less-than-statistically-superior linear model, we would see relatively large swings in imputation, due to an overall poorer fit, when compared to before, which may or may not have been best.

I think we would all agree that if slow trending in catch is occurring over time, we would expect to see more increases followed by decreases (or decreases followed by increases) with increasing amount of time fished. We roughly incorporate that by only allowing more complex models with increasing data points (which implies more time fished). But, there is no reason we have to stick to 15 data points as the model-jump criteria. There is also no reason to keep it as a linear progression (although that is easier to program).

I'm confident this is the fundamental reason as to why the imputation numbers have changed. Note that in some cases, e.g., rows 84 and 87, the imputed numbers go down. We would expect that sometimes the imputed numbers would go up, and other times they would go down. Further, in general, given that 15-data-point threshold, we would expect that the spline models utilized in the new code are generally simpler than those used in the old code, where there was no such restriction. This is indeed the case. The following table itemizes the type of models used in the old and the new code. Note the two extra lines for trap 57005, due to the gap in fishing.

Trap       Non-zero 
           Data Points   Old Model                 New Model
-------------------------------------------------------------------------
57001      69            Cubic Spline w/ 1 knot    Cubic Spline w/ 1 knot
57002      41            Cubic Spline w/ 3 knots   Quadratic
57003      22            Quadratic                 Linear 
57004      30            Cubic                     Quadratic
57005      11            Cubic Spline w/ 1 knot    xxx
-------------------------------------------------------------------------
57005       8            xxx                       Intercept-Only 
57005.01    3            xxx                       Intercept-Only

I stress that the number 15 was chosen arbitrarily, as it's a "nice" number. I think in general, if we decide we want to change it, it makes sense to determine the minimum number of data points (but days may be better?) to model a catch spike, or maximum. At its simplest, this translates to a parabola, or quadratic model, and hence a means by which we could tune other models' thresholds.

Gap in fishing

As noted, there is a gap in fishing for trap 57005. This changes passage a decent amount in this case, as suddenly, there are several days for this trap that are never fed to the program, and thus have no passage estimated. This is by design.

Validation stuff

Efficiency It looks like all the efficiency numbers remained the same between the old and the new. I think going to the data sheets for these numbers is not a good use of time. We should probably do that when we develop the updated efficiency models anyway, so I don't think it makes sense to do that level of checking at this time.
Catch We could just run the numbers utilized last time and compare them to the old. i have that big Excel spreadsheet in a file in my desk. If the numbers are 100% the same, then we could conclude that no more checking need take place. If there are a few discrepancies (one or two here or there), we could probably look to see if there has been an update to the database in the interim. And if there is a discrepancy of a lot (100+?), we would then have a means of quickly whittling down the problem (if there even is one).

We have made many changes and updates; this version is now obsolete.

tmcd82070 / CAMP_RST