tmcd82070 / CAMP_RST

R code for CAMP rotary screw trap platform
1 stars 1 forks source link

Gaps in Fishing and Their Relationship to Passage Estimates #71

Closed jasmyace closed 8 years ago

jasmyace commented 8 years ago

OVERVIEW: Sometimes, when calculating passage estimates for a given time frame and river, the passage estimate turns out to be something obscene. This is easily seen to be attributable to large gaps in fishing.

RESOLUTION: Identify "large" gaps in fishing as those in excess of 14 days. Break up splines to respect these large gaps, renaming trapPositionIDs, if necessary, in order to differentiate broken-up temporal sequences of catch originating from the same trap.

DETAILS: As an example, consider the results of estimating annual fall estimates on the RBDD, utilizing years that start on 12/1 and end on 11/30. In this example spanning 14 distinct years (and thus 14 distinct estimates), several of the years appear to estimate fish that appear markedly out of line.

image

In this spreadsheet, each row represents a particular report from the Platform. Here, each row of data originates from the ALL Runs report, with estimates generated via the year option. Additionally, columns min.date and max.date itemize the passage year of interest, for that row. The run column demonstrates that these data originate from the Fall run. The other columns before column K don't matter, or are self-explanatory. Other passage results from other runs, e.g., Spring, do not appear here.

Columns L and M itemize the lower and upper confidence bounds associated with the passage estimate in column K. The other columns after column M are artifacts from the confidence-interval comparison process, and in this case, can be ignored. Note that in the passage year spanning 12/1/2004 - 11/30/2005, we are estimating 418 trillion fish!!! Also observe that the confidence intervals here appear to not be "fixed," as explained via a different Issue. Given the clearly erroneous passage estimate, the corresponding confidence interval should not be expected to adhere to normal expectations.

In investigating the behavior underlying the generation of these estimates, it appears that gaps in fishing are the culprit. For example, in year 2004 (so the estimate spanning 12/1/2003 - 11/30/2004), Gate 7 experienced a gap in fishing of approximately 3 months. (Note here, via the spline plot, that this gate experienced two gaps in fishing [as seen by the two long temporal sequences of blue open squares]; however, only the first one seems to be affecting the catch estimates negatively.)

run 39--year_sacramento river_rbdd rst_2003-12-01_2004-11-30_gate 7_fall_all lifestages_spline

To "turn off" the spline over "large" gaps in fishing, it has been decided that we will identify a "large" gap by any period of "Not fishing" of duration greater than 14 days. To emphasize that 14 days is somewhat arbitrarily chosen, we will allow this to be a global variable within the R code, so that it can be changed at a moment's notice. We will also investigate the distribution of gap lengths over several rivers, to see if the data suggest a different, more data-practical cutpoint.

jasmyace commented 8 years ago

I have worked to create some plots / estimates of distributions regarding gaps in fishing. As usual, there are many details to consider.

To create a data sample for each river, I ran Connie's catch-query sequence with dates of "1980-01-01" through "2016-03-02". This means I disabled the criterion in the R code that forces a Platform user to restrict their investigations to be less than or equal to 365 days. I chose 1980 to be sure I got all possible valid fishing ever. Because I get all valid fishing, I can be assured of also getting all valid "Not fishing," since inserting these into the catch-trap sequences is the whole point of Connie's queries.

So, I pulled these query results from each of these river, siteID, species, and start- and end-date combinations:

American River 57000 161980 '1980-01-01' '2016-03-02'
Feather River 3000 161980 '1980-01-01' '2016-03-02'
Feather River  52000 161980 '1980-01-01' '2016-03-02'
Feather River 5000 161980 '1980-01-01' '2016-03-02'
Feather River 4000 161980 '1980-01-01' '2016-03-02'
Feather River 2000 161980 '1980-01-01' '2016-03-02'
Feather River 6000 161980 '1980-01-01' '2016-03-02'
Sacramento River 42000 161980 '1980-01-01' '2016-03-02'
Stanislaus River 1000 161980 '1980-01-01' '2016-03-02'
Mokelumne River 34000 161980 '1980-01-01' '2016-03-02'
Knight's Landing 63000 161980 '1980-01-01' '2016-03-02'

These river and siteID combinations originate from the Big Looper. I specifically avoided using the start and end dates tied to queries in the Big Looper since a biologist could easily choose dates other than the ones we use there.

Note that given a trap, Connie's query doesn't put in a leading "Not fishing," nor a trailing "Not fishing." This means that if a trap started fishing for the first-time ever on "2010-01-01," no "Not fishing" record is inserted for this trap to cover the period from 1980 through 2010, when this trap was (also) arguably not fishing. "Not fishing" instances are only inserted on the "inside" of a fishing sequence; i.e., the first and last record of a catch query associated with a trap must constitute valid fishing.

Connie has designed her queries to throw out any "Not fishing" of duration less than 30 minutes. This is probably in order to exclude those small fractional downtimes when the biologists are swapping out the traps, etc. I suspect these number several, and so support their exclusion. This means that gaps in fishing, when defined in minutes, are always greater than 30. It turns out they can sometimes be much greater than 30 minutes.

The largest gap in fishing covers 2,289,976 minutes, or 1,590 days, and occurs on the RBDD, Gate 8. There is nothing abnormal with this value. It simply suggests that the biologists at the RBDD rarely position their traps at Gate 8, either because it's difficult to get to, or they decided to only sample there when the budget was flush, or whatever. So, somewhere in the temporal fishing sequence for Gate 8, there was a gap of a little more than 4 years between one fishing instance at Gate 8 and the next. Practically, a biologist would never have to worry about a gap in fishing this big, because we force them to select less than or equal to 365 days. So, I threw out all SampleDays (equal to Connie's SampleMinutes / 60 / 24) that were greater than 365. The resulting set of data covered the entire universe of "Not fishing" of duration less than or equal to 365 days, and included up to 38 traps.

Based on the resulting data, I made a series of plots for each trap. To start with a "nice" example, consider North Channel 8.1 on the American.

North Channel 8.1 on the American

image

Here, the first "plot" on the left identifies the river, the site, and the trap name.

The second plot is a histogram of all the periods of "Not fishing." The maximum value possible here, in minutes, is 365 times 24 times 60 = 525,600, since I restricted the data to be of duration less than or equal to 365 days. Many histograms for different traps show a pattern similar to this -- lots of data values of not-so-many minutes, and a few token "Not fishing" periods of a LOT of minutes.

The third plot is a plot of the so-called empirical distribution function ("EDF"), and shows the same as the histogram, except along the y-axis, it shows the proportion of data points "explained" thus far, as a function of (here) days. For example, about 95% of the "Not fishing" records represent not fishing periods of less than about 10 days. About two data points represent non-fishing periods in excess of 200 days. This being a "nice" river, those two data points are the two loooong periods of non-fishing in the off season currently recorded for this trap. Also because this is "nice," these are basically the only two abnormal "Not fishing" gaps. These are expected. Finally, note in the "EDF" plot here the long horizontal red line of nothing. This means there were no recorded gaps in fishing of anything between 10-ish days, at the most, and 225-ish days. That makes this river "nice."

The fourth and fifth plots are the same as the second and third, except I have restricted the data to focus on the first 95% percent of the data points. I chose "95" arbitrarily. It may not be the best value. Its purpose was to throw out the "by-design" gaps in fishing, as exemplified by the two dots described in the previous paragraph. In this case, once the large "Not fishing"s were thrown out for this trap, the "EDF -- 0 to 95th Percentile" spread out in a nice way. You can see how, for this trap, many of the "Not fishing" periods cluster around "1 day" and "2 days." Given what I've seen for this river before, I suspect that the "2 day" values tie to weekends.

Caswell South Trap, Stanislaus River

The trap on the Stan is similar to the American. It differs from the American trap just described, however, in that there are a lot more large "Not fishing" points, as exemplified by its "EDF" plot. You can see many more blue dots along the top right of the "EDF" plot.

image

Without actually looking at the data, I suspect the reason for many dots here are due to more than just two/three years of data being collected, which was the case for the American trap. This river's EDF plot also shows the "nice" long red horizontal line. I suspect the gap in fishing is not an issue for the Stanislaus, although the second horizontal red line, between 225-ish to 250-ish days, may be worthy of investigation. I suspect that if we were to run the American-trap plot, ten years from now, it would like something like this EDF here.

Gate 7, Sacramento River (RBDD)

Now take a look at the series of plots for Gate 7 on the RBDD. Here, the EDF plot demonstrates a whole distribution of "Not fishing" values. It seems this is the pattern of "Not fishing" which may help to determine the large gaps in fishing. Also observe that many of the blue dots along the left of the EDF spill over at least 25, on the x-axis, and maybe even to 40. This pattern is at odds with the behavior displayed for the American, where the left-sequence of blue dots wase very well contained to the left. A wider EDF here, so as to see the nuance of the distribution for days between zero and 50, may be better in this case. This will serve as a good start for now.

image

Finally, here is the one png file with all 38 traps. Be sure to not print it. It's not of standard paper size, and will print out over MANY different pages.

theGaps_DO_NOT_PRINT.zip

ConnieShannon commented 8 years ago

What do you think about adding an IF ELSE statement prior to development of the daily average summary. ( I'm talking about when the R app divides the total production from all traps by the number of traps.) The IF statement can check to see if more than one trap is fished and if so, the production is the average of only traps that were fishing and does not include days the catch was imputed because the trap wasn't fishing.

Just a thought.

jasmyace commented 8 years ago

For the RBDD, it's often the case that many traps are operating on any one day. With an increasing number of traps, the probability that a day in question requires imputation on one of those (possibly many) traps increases. This means that for many days, we would be throwing out data points. To be fair, for those days where we throw out a trap estimate for that day, we would also be reducing the denominator by that same number. So, there would be some "self-correcting" in this process.

My concern would center around day(s) where the only estimate of fish we have is due to imputation, irregardless of the number of traps operating. (I think this is why you say check if more than one trap is fishing.) I feel like this could happen frequently, say, on the American, over a weekend, where I have previously seen gaps in fishing of two days. What if the two traps over the weekend aren't fishing? (I suspect they try to make sure this doesn't happen, and that one is always going.) What if one is left to fish, but then we have to throw out that one-trap sample? (I'm certain this happens sometimes.) In this case, we would have no estimate for those days, and thus would bias our passage estimate for the entire river downwards, since the passage for those days would get a zero.

Another concern centers around the fact that we have both full-day imputation and semi-imputation. When I say "full-day" imputation, I mean we imputed for the full 24 hours. But what about the "semi-day" imputation, where we imputed for less than 24 hours? In this case, we have some data, because a trap was fishing for a portion of the day. Do we throw these out, even though they constitute good data with (maybe bad) imputation? We could never chuck all the imputation without also throwing out some valid fishing data.

Finally, it's worth pointing out that it's a minority of imputed values that mess everything up. The problem is that when they go bad, they do so in a VERY LARGE way.

jasmyace commented 8 years ago

Connie suggested identifying the imputed values as a means to differentiate long gaps in fishing, separate from "regular" imputation, which usually covers less than a day's worth of non-fishing. I misunderstood her suggestion as grouping all imputation together, irregardless if it were a "gap," or not.

In the end however, it was decided that long gaps in fishing would be defined by updating the catch query sequence. Additionally, long gaps were defined to be set at 7 days. Finally, this value of 7 days would be set as a global variable within the R code -- this means that 7 could be replaced by any other value at a future time with ease.

In order to identify trapPositions whose fishing sequence contains a 7-day period, the trapPositionID of that trap was amended with a decimal suffix. Thus, if a trapPositionID experienced two separate 7-day gaps in fishing, the original trapPositionID used to uniquely identify that trap would have two suffixes added, with a different suffix for the two new fishing periods. For example, if a trapPositionID were 12345 before consideration of two greater-than-7-days gaps in fishing, the trapPositionID after consideration of the two gaps would lead to three separate trapPositionIDs of the form 12345, 12345.01, and 12345.02. These three trapPositionIDs would then be fed to the R sequence of programs.

In order to ensure that efficiency trials, whose trapPositionIDs are not amended in any way, match correctly with the catch spline results, all transformed trapPositionIDs amended for catch splining purposes are transformed back to their original trapPositionID. In this way, all catch data, regardless of trapPositionID, find its appropriate match, if available, in the efficiency data.

Running the Big Looper on the RBDD, after accounting for gaps in fishing, provides results that appear to be reasonable. Here, column bEst is the estimate for passage, while bLCL and bUCL describe the 95% confidence interval. Keep in mind that the passage estimate for 2005 includes an efficiency of zero for one of the operating trapPositions, which blows up the passage estimate. Remember that we decided to leave this alone for now.

image

Finally, these passage results do not consider two other issues, described elsewhere.

  1. Updated spline knotting methodologies, due to shortening of fishing periods, following consideration of gaps.
  2. The need to multiply recaptures in the estimation of efficiency by two, if those recaptures originated during halfCone operations.

Separate Big Looper runs quantified the effect of those updates, and are described in their respective issues. Otherwise, this change appears to be working as intended.