tmcd82070 / CAMP_RST

R code for CAMP rotary screw trap platform
1 stars 1 forks source link

SOW 2 - Task 2.2: Extremely large upper CI estimates #15

Closed tmcd82070 closed 8 years ago

tmcd82070 commented 10 years ago

Sometimes, upper confidence intervals are essentially infinite. Need to come up with a reasonable number to report for the upper limits in these cases.

jasmyace commented 9 years ago

Trent suggests that the problem is possibly/probably due to bootstrapping associated with the release data; specifically, the super small capture probabilities p lead to corresponding to super high calculations of passage, when considering that the p enters the calculation as a denominator, thus leading to a blowing-up of the overall estimate.

Currently, it is believe that a bootstrap matrix, constructed during the bootstrapping process, and so probably in function F.bootstrap.passage, consists of 500 columns, with the number of days to bootstrap equal to the number of rows. (There exists one matrix for both the catch and the release.) In any case, within the matrix exists cells with low values, which is believe to be causing the issue.

One possible solution is to exclude those values below some certain threshold--this necessitates however the defining of what one means for 'small.' Another possibility involves transforming via a log and then back-transforming. Another, possibly more involved, solution could involve the incorporation of some informed prior to ensure that resulting estimate ps don't blow up.

In any case, it is currently believe that this issue can be solved inherently within the code by a more judicious consideration of its inputs.

jasmyace commented 9 years ago

Address and resolve the issue of out-of-range upper confidence intervals in the CSV files associated with the production reports.

jasmyace commented 9 years ago

Result of Inquiry: Initial investigation of this behavior focuses on daily estimates of the American River, for the Winter run, covering 1/16/2013 - 6/8/2013. For this run (a result of the big looping program run back in May), I observed that the right confidence limit for 4/25/2013 was 4.90 * 10^271. Working backwards from this number, I concluded that even though this behavior is definitely local, i.e., specific to day 4/25, its cause is global, affecting all estimates throughout the requested time period.

Essentially, the inclusion of all days from 1/16-6/8 into the estimation of the model leads to this behavior. In this case, the fact that none of the five traps in this period caught any fish after 3/30 leads to this strange behavior. The model, observing a very strong signal of zero fish for the 62 days between 3/30 and 6/1, ends up including a quartic term into the underlying spline, which is fit on the log scale. This quartic-term spline, whose beta estimate and associated standard deviation is two orders of magnitude larger than the others in the model, overpowers all random estimates arising from the bootstrapping procedure -- this in turn leads to some confidence interval estimates that blow up when the estimates, calculated on the log scale, are exponentiated back to the original scale of observed fish. It should be noted that the model is correct, in that it is estimating, and reporting, the best spline based on the conditions we provided.

Proposed Solution: The fact that the biologists recorded so many days of zero fish is useful information; however, we suspect that truncating the data provided to the model, by deleting out superfluous zeros -- on both ends of observed data -- should lead to more parsimonious models. Since the temporal ramp-up of fish into traps at the beginning of a season may not necessarily equal the corresponding ramp-down of fish at its end, we proposed truncating any string of zeros by x days at the start of the observation period, if present, and doing the same, by truncating by y days at the end. In this way, head- and tail-end zeros should not contribute overwhelmingly to model fit.

Dependencies: It has been observed -- especially via Task 2.4 -- that sometimes the model estimates non-zero fish when several of both the preceding and subsequent days have zero observed fish. The implementation of this proposed solution should mitigate this concern for all days in which the zero truncation has been applied. For example, in the run described above, trap 57004, on day 5/27, results in a final passage estimate of 18 fish. This results from an imputed value of 0.2 (ish -- this has been rounded, and feeds into Task 2.5 I believe) being divided by the estimated efficiency on this day of 0.0113. Given that the imputed fish for these days will be manually set to zero, resulting passage for these days will automatically be zero.

Possible Issues:

  1. Each trap could have different head- and zero-tail behavior. So, truncation will probably have to be developed for each trap separately. Given that models currently estimate on a per-trap basis, this shouldn't be a large issue.
  2. In the example cited here, efficiency trials occurred in the tail-zero region for which a truncation procedure is proposed. This shouldn't cause any issues, as the model results will have blanked zeros inserted for when the model would have provided zeros anyway, i.e., nothing should really change. In theory.
  3. Confidence intervals for zero-truncated days will necessarily by non-existent, or (0,0), as the models will not estimate anything on these days. This means that the amount of data provided to models will be less, leading to (slightly) wider confidence intervals than previously experienced.
  4. While the implementation of a truncation procedure should definitely reduce the manifestation of this issue, it cannot guarantee its total obliteration. We will need to investigate the rate at which the problem is reduced. For example, given the big looping program, it should be possible to identify, via the CSV output, the number of times exploded confidence intervals occur, before and after the suggested implementation. I think we will only need to examine the daily runs, since weekly, monthly, and annual runs are simply rollups of estimates obtained on individual days.
  5. It is unknown how total passage estimates will change as a result of this implementation. In theory, large rivers with large runs should only see small changes, as the exclusion of truncated zeros should only change passage on a small scale at the tail end of runs -- when passage is already relatively small. For example, the estimate included here for day 5/27 and trap 57004 calculated a one-day passage of 18 fish. This is small compared to the millions of fish commonly estimated on big rivers and runs. For smaller runs such as the Winter run included here, however, the suggested change may result in a noticeable difference.
  6. While this implementation should solve this particular reason as to why the confidence intervals sometimes explode, it doesn't rule out the possibility that other reasons can lead to this behavior. The checking/quantification procedure outlined in Issue 4 should help to gauge this.

ModelInvestigations.xlsx modelestimatesr day_american river_american river at watt avenue_2013-01-16_2013-06-08winter_catch day_american river_american river at watt avenue_2013-01-16_2013-06-08winter_eff day_american river_american river at watt avenue_2013-01-16_2013-06-08winter_passage

jasmyace commented 8 years ago

GENERAL UPDATE I have successfully implemented a proposed solution, with promising results. Basically, I truncated preceding and antecedent zeros from sequences of observed fish, on a per trap basis. In the (small) trial run with which I’ve been working, the five instances of exploding confidence intervals, as reported via the passage table output, have been resolved. As alluded to in a previous update regarding this issue, the exploding confidence intervals are intimately tied to the beta-coefficient weights resulting from the fitting of polynomial splines on the log-scale, where such fittings include strings of zeros.

SPECIFICS I first ran weekly passage estimates for the American, from 1/16/2013 through 6/8/2013. From this run, I found that five weeks, in the Winter run, had exploding confidence intervals. These were Julian weeks 14, 15, 17, 19, and 20. I then amended the code to throw out preceding and antecedent strings of zeros. This means that, for each trap, starting on 1/16, I searched all daily-ish fishing results, proceeding forward in time, on a per-day (actually, a per-day, per-trapping instance) basis. I did this until I found the first instance of caught fish, at which point, I stopped. Given the first found fish, I then deleted out all of the zero and Not Fishing instances temporally before this trapping instance. I then did the same for the tail end of the fishing period, starting on 6/8, and going back in time, per daily fishing instance, throwing out any fishing results with zero fish, along with Not Fishing. I then stopped on the first day (or fishing period), when going backwards in time, in which at least one fish was caught. In both directions, just one unassigned caught fish was good enough in order to stop. In this way, I trimmed the trapping sequence so as to ensure that the “first” and “last” trapping instances caught at least one fish. Strings of zeros that were located between this new “first” and “last” fishing day were not amended in any way. Zero strings of this nature are more likely to occur (for this river) for non-Fall fish.

THINGS TO NOTE

  1. Base Tables: Truncating the zeros before and after leads to Base Tables that only report fish data starting on the first and last non-zero weeks. Due to 2013 being one of those years for which the American gear traps were switched out, this impacts the number of days reported per trap in this table; it necessarily also impacts the number of Julian weeks reported in the Summary tables. It’s important to note that the weeks that have been removed would have simply reported zero for passage, so no data/fish have been lost. In fact, all fish accounting works as before.
  2. Spline Behavior: The figure below demonstrates the effect of removing leading and trailing zeros, in an effort to remedy exploding confidence intervals. In this case, the graph on the left depicts the situation as it frequently happens. A pulse of fish is first observed, leading to high catch volume. Quickly however, that pulse tends to peter out, leading to a quick decline in observed fish, which is then followed by a string of zeros, and possibly missing data. Here, due to the long string of observed zeros, the red spline inflects itself in order to respect the string of zeros. Note that for the two imputed days after the pulse of fish, fish are estimated.
    On the right, the zero days after the pulse have been removed, in line with the algorithm described above. For these days, the imputation no longer imputes, because these zero days are no longer considered as fishing periods. Said another way, the imputation for this string of zero days is zero. This behavior leads to a change in the total number of estimated fish.

image

EXAMPLE RESULTS I have uploaded the American run described above, both before and after the zeros adjustment. Observe the following characteristics between both sets of data. Note further that I’ve amended the “Original” data sequence by preceding that output with a capital “O.” This allows you to open up both the original and updated output, which otherwise would not be possible, due to Excel not liking both files having the exact same name.

  1. As a first check, I opened up all 8 summary files, for the 4 runs, over both the original, and the updated code runs. I then arranged these in Excel, as seen below.

image

I observed the following: a. Changing the preceding and trailing zeros should only really change, possibly, the estimated number of fish at the start and the end of a run with ample amounts of data. So, I looked at the fall run. I find that before the update, there were 5,544,853 fish. After the update, there were 5,544,853 fish. Looking more closely over each Julian week, I find that all weeks agree. The perfect concordance here suggests that there were no preceding or trailing zeros, for any of the traps, for this run.
b. In comparing the Late Fall runs, you can see how dropping zeros affects estimates. Before, we estimated a passage of 1,714 fish. Now, we estimate 1,083 fish. The decrease is more or less due to the situation described above, where imputation in a preceding or antecedent zero-run region is no longer allowed. This is perhaps most readily apparent in the catch pngs for this run. In the below, the updated catch estimates are on the right, while the older are on the left. Note that for trap South Channel 5 only 1 fish was ever observed – this is apparent by the purple dot corresponding to 4/13. Here, this one fish, on this one day, leads to a passage expansion estimate of 1,206 fish for this trap. Previously, we imputed for Not Fishing periods during this trap's preceding zero run (i.e., before 4/13), as seen by the horizontal line of imputations straddling 4/1 in the graph on the left. These imputations led to a total passage estimate of 3,505 fish, for this trap. Now, since we threw out the preceding-zero run, the imputation over several days is no longer allowed, leading to the lower passage estimate of 1,083.

image

OTHER

  1. Implementing this procedure would somewhat help resolve Task 2.4, as it would deal with before and after caught-fish periods of zero fish.
  2. This procedure makes no guarantee of what happens when a string of zeros occurs inside a fishing period. However, if a string of zeros does occur for a rather lengthy period of time inside a fishing period, I wonder if it would be worthwhile to consider a procedure where we just perform two separate passage estimates, over two separate runs of the Platform?
  3. Other csvs included in the Updated folder of the zip file I will send out via regular e-mail include csvs with a suffix involving a trapPositionID. These are selectively output, and compare the beta coefficients that occur from fitting a spline, based on keeping so many of the preceding, and so many of the trailing, zeros. I can explain these more at a later time.
jasmyace commented 8 years ago

This update involved changes to programs

  1. est_catch.r a. lines 40-45: house results of buffering of zeros b. lines 64-109: perform the buffering analysis c. lines 142-148: pull out allDates dataframe to identify valid non-zero date ranges per trap d. line 338: add allDates to final list, so it's output and available for use by other functions.
  2. est_passage.r a. line 210: generalize the calculation of passage by forcing zeros in the case of NA b. lines 105-113: merge in redefined timeframes of non-zero catch from allDates
  3. est.efficiency.r a. lines 33: add beginning and end dates to dataframe of interest.

Turning off these updates involves commenting out these sections of code, and perhaps turning back on their previous iterates. Places where this occurs are obvious. Other options include not looping over all zero combinations in the buffer analysis of program est_catch, as well as turning off the output of CSVs from said analyses, in the same general spot in the code.

ConnieShannon commented 8 years ago

Under the section THINGS TO NOTE above you indicate that the base tables and most likely the headers on all products such as CSV will not include the actual trapping period.

If this solution is adopted It might be a good idea to indicate in the products the true trapping period and the dates fish were captured during that period. for example "Trapping at the American River began [add first sample date here] and ended [last sample date here] and fish were caught from [first day Chinook were caught] through [last date Chinook were caught]". This will answer the question, is the production estimate lower because the period of trapping was shorter?"

Also, the Gateway trap one the Feather River was pulled for nearly two months during the middle of the season in 2015 and the analysis did fail. After Doug and I discussed we recommended that the program develop two production estimates one for the first part of trapping and one for the second. We felt this was appropriate and so did the biologist. This is what you suggested above and it was a solution in this case.

tmcd82070 commented 8 years ago

Action Jason to test whether truncating leading and following zeros solves the infinite confidence limits at all sites.

Trent likes Connie's suggestion of reporting beginning and ending of season. (assuming this solution works at all sites and is implemented)

To Discuss (Perhaps this has been discussed.) Should the routines fail when a trap is pulled for 2 months in the middle of a season, or should it produce a result. Ideally, we would "fail nicely" by issuing a warning or something when there is a gap > X days in the middle of a season.

jasmyace commented 8 years ago

As part of the next stage of work on this Task, I was asked to create a complementary plot of the actual spline used within the passage estimation procedure. Keep in mind that each trap is tied to its own unique spline. I think these plots will be useful for a variety of applications, so I spent some time to make them pretty, and relatively clear.

Here is an example of what I've come up with, for the RBDD, 10/1/2012 - 9/30-2013, trap Gate 3, Fall run.

week_sacramento river_rbdd rst_2012-10-01_2013-09-30_gate 3_spline

The blue line is the actual spline. Keep in mind that all the actual statistics take place on the log-scale. This means observed counts of fish are log-transformed during the analysis. But, at the very end, the results are put back on the natural scale. Practically, this means that the blue curve here is a function of the form y = e^(a polynomial). In this case, the polynomial used here is of the sixth-degree. The exponentiation ensures that all estimated (imputed) fish counts are greater than zero. The use of exponentiation is also suggested via statistical theory.

The y-axis represents the true count of fish found in a trap. In other words, the y-axis here corresponds, on a per-day basis, to the sum of the assignedCatch and unassignedCatch, as reported in the csv baseTable. For the four traps operating during this trapping year on the RBDD, my fish count corresponds to what is eventually reported via the baseTable. The plot here is only for Gate 3 however. Keep in mind that this particular plot doesn't care about assignedCatch vs. unassignedCatch, so that particular differentiation doesn't manifest here.

Note that in addition to the blue exponentiated spline, there are many circles/squares. All days have at least one circle or square. This is because we either catch at least one fish (or record a zero), when trap data are collapsed to a day, or we impute some number (maybe zero) on days when the trap wasn't operating. You know this because the csv baseTable never skips a day.

In trapping, sometimes the trap is set to be half-cone, and sometimes it's full-cone. These are differentiated within the spline plot via color; half-cone days are red and full-cone days are black. Days involving multiple trapping records are colored red if at least one of those trapping instances were half-cone.

Days on which no imputation was required, i.e., the trap was running the full 24-hour period, are circles. If, for some reason, catch needed to be imputed for a portion of a day, that day, in addition to the observed catch seen via the circle, also has a square. These are easily identifiable by the vertical line connecting the two.

On January 9th, you can see that a relatively small number of caught fish (via the pink circle at y=384) has a horizontal line connecting it to a red square at y=2152. This suggests, for this day, that the trap operated for a short time, leading to a large imputation. In this short time, however, the trap caught more fish than it otherwise should have, on average. This is how the imputed value is so high above the exponentiated spline.

On January 8th, you can see a red dot, with no vertical line. This means that the trap was fully functional that day. So, no imputation occurred. On this day, the trap caught y=361 fish. This y value is less than the spline, which means it was a slow fish day. But the trap was fishing the whole day, so it is what it is. Note that the red dot indicates that this was a half-cone operation.

Sometimes, an entire day receives an imputed value. These are identifiable by white squares outlined in blue. Usually, these fall squarely (heh) on the spline, but there are a few days where they don't. I haven't investigated why this sometimes happens, but feel comfortable leaving it alone, since the numbers being fed into the plot agree with the imputed values being reported on the baseTable for those days. I can figure the reason out later.

These plots will help us to visualize how the spline changes once we delete out (in this case) all of those black-circle preceding zeros, in an effort to correct the exploding confidence interval problem. This will change how the spline chooses to curve upward starting the beginning of December. Analytically, this is the same as investigating the behavior of how the inflection point temporally changes, as a function of removing preceding (and trailing) zeros.

These plots will also help to show us how the spline changes when all those pink dots of caught half-cone fish are multiplied by two. Practically, in this case, it means the spline will spike quite a bit more in the middle of the plot, as all of those pink circles double in height. I had always assumed that half-cone operations were a once-in-a-while thing; however, this plot suggests that when the fish pulse begins, the trap is basically left half-covered, so as to lessen the burden of counting all those fish. This will change the spline not only in the red portions, but in the neighborhoods where the black transitions to red, and vice versa. So, the next step will be to actually run the numbers, and see how they look and compare.

I also find it interesting that partial-day imputation, in this case, seems to cluster in the middle of January, but that's not really pertinent to anything we're dealing with now.

dougthreloff commented 8 years ago

Jason:

Thanks for the extensive write up on task 2.2.

I think you said you would run the Big Loop which would develop production estimates for several runs and watersheds? If that is the case and your review of the summary excel spreadsheet notes that all the out-of-range confidence intervals no longer exist, then I will assume you have addressed task 2.2. If your review doesn’t show that all the out-of-range confidence intervals have gone away, I will assume you still have more work to do.

I remember a few cases for salmon like winter-run having an imputed value

0 even when no fish of that run had previously been caught in a season, and no actual fish for that run were caught for several days or even weeks after the imputed >0 value. I will be curious to see if the R code revisions for this task result in some imputed values for salmon runs with very low catches of fish going to 0 when there are not adjoining actual catches.

As we discussed on the phone ~ 2 weeks ago, I will need to let the biologists know your R code revision could result in changes to their previous production estimates.

Doug

jasmyace commented 8 years ago

I have completed my initial investigation at how the proposed solution of truncating zeros before and after the start of true catches affects confidence interval estimates. As a start, I ran the Big Looper code, for the two time frames (loops) on the American that comprise its share of what is checked via the Big Looper. For each of those two loops, I ran both the run.passage code (ALL runs) and the lifestage.passage code (by lifeStage and run). Finally, for the run.passage code, I ran the code for each of the day, week, month, and year timeframes. While the only differences between the four temporal timeframes happens at the end of the code, when passage estimates by day are rolled up into the requested timeframe, each ends up with boostrapped confidence limits. So, since it's the confidence intervals that are the focus of this task, I cast as wide a net as possible.

Given the above, boostrapped confidence limits arise from three sources:

  1. The final report from a lifeStage passage request, entitled xxxx_lifestage_passage_table.csv;

image

  1. The final run-, and timeframe-specific reports from an ALL runs passage request, entitled xxxx_[run]_passage_table.csv;

image

  1. The final reports from an ALL runs run run_passage_table.csv.

image

Given this setup, I then ran all of the code just described two times: the first, without the zero deletions, and the second, with the deletions. It should be noted that both of these incorporate the "times two" adjustment for half-cone, although for several rivers, that adjustment doesn't come into play.

For both the before and after, I compared the right-hand confidence limits to the estimates they were bounding. Given those comparisons, I then took a look at how they changed, from the before, to the after. In each case where I identified a bound that was too high, the zero-deletion fixed the problem. Note that I am glossing over some details here.

Next, given that the zero deletions ultimately change the underlying spline, I compared the passage estimates between the before and after. In all cases, the passage estimates remained more or less the same. The only cases when the passage estimate changed by a lot was when it was originally small. For example, in one case (Winter run 6/8/2013), the passage estimate was 1 before the zero deletions. (Keep in mind there aren't many Winter fish on the American -- running an ALL runs by day only found one day on which Winter fish were caught.) After the zero deletions, the passage estimate jumped to 12. While a 12-fold increase, the relative change in raw fish is small.

I'm obscuring a lot of details here...let me know if you want to see more. Currently, I'm running the same as described above for the American, on the RBDD, for each of day, week, month, and year, for before and after, for both the ALL runs, and the lifeStage reports. The time frame for each is set for the calendar year from December 1 - November 30.

When the code run is complete, I'll do the same type of before and after comparison as described above for each of the bootstrapped confidence limits. Given that running the passage code this much results in a lot of output, we should have a large sample of confidence intervals to compare the before with the after, and then conclude definitively how well this proposed solution worked out.

jasmyace commented 8 years ago

In inspecting the 37,636 distinct confidence intervals resulting from all of the non-zero passage estimates obtained over the 10,758 pieces of output from the Super Big Looper, only 4 still have "out-of-line" confidence intervals, using the criteria I previously used in coming up with a solution.

The 4 outliers are due to the weird RBDD estimates that I noted in a different Issue and via a conference call. So, assuming the gap in fishing remedies the passage estimates, these last 4 remaining intervals should come in line.

So, I will now close the Issue.

tmcd82070 commented 8 years ago

Awesome. Fantastic. Nice work. I agree, close this.

------------------------------------------------------- Trent McDonald, PhD Senior Statistician

Environmental & Statistical Consultants 200 S. Second Street Laramie, WY 82070 (307) 721-3172 (307) 760-4721 Cell tmcdonald@west-inc.com www.west-inc.com

Follow WEST: Facebook http://www.facebook.com/pages/Western%E2%80%90EcoSystems%E2%80%90Technology%E2%80%90WESTInc/125604770807646 , Twitter http://twitter.com/WestEcoSystems, Linked In http://www.linkedin.com/company/1458419, Join our Mailing list http://visitor.r20.constantcontact.com/manage/optin/ea?v=001qrD4A3S5xJ5KgMyelH9jyw%3D%3D

CONFIDENTIALITY NOTICE: This message and any accompanying communications are covered by the Electronic Communications Privacy Act, 18 U.S.C. §§ 2510-2521, and contain information that is privileged, confidential or otherwise protected from disclosure. If you are not the intended recipient or an agent responsible for delivering the communication to the intended recipient, you are hereby notified that you have received this communication in error. Dissemination, distribution or copying of this e-mail or the information herein by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. Thank you.

P Please consider the environment before printing.

On Wed, Mar 2, 2016 at 7:27 AM, Jason notifications@github.com wrote:

In inspecting the 37,636 distinct confidence intervals resulting from all of the non-zero passage estimates obtained over the 10,758 pieces of output from the Super Big Looper, only 4 still have "out-of-line" confidence intervals, using the criteria I previously used in coming up with a solution.

The 4 outliers are due to the weird RBDD estimates that I noted in a different Issue and via a conference call. So, assuming the gap in fishing remedies the passage estimates, these last 4 remaining intervals should come in line.

So, I will now close the Issue.

— Reply to this email directly or view it on GitHub https://github.com/tmcd82070/CAMP_RST/issues/15#issuecomment-191258960.