tmcd82070 / CAMP_RST

R code for CAMP rotary screw trap platform
1 stars 1 forks source link

"Corrected" Bootstrapped Confidence Intervals Reveal Upper End Points that are Less Than Estimates #70

Closed jasmyace closed 8 years ago

jasmyace commented 8 years ago

In investigating the behavior of corrected bootstrap confidence intervals, it has been discovered that rarely, the right end-point of a confidence interval is less than its associated passage estimate. This behavior is clearly not desirable.

Focus at the moment centers on the Sacramento River, 12-01-2001 - 11-30-2002. During this timeframe, we have noticed that sometimes, the ALL Runs report, when run on "day," exhibits this behavior on the first and last day in the overall time series. Sometimes, a different day here or there exhibits this behavior as well. However, it is peculiar that it manifests most often at the start and end of a temporally-based estimation sequence. In theory, other types of ALL Runs reports, e.g., "week," etc. could also demonstrate this behavior.

In the screenshot below, the passage estimate from 5/24/2002, originating from the Fall-run passage.csv has a non-ideal confidence interval of (26,146, 79,119), but with a passage estimate of 121,049, on the displayed date of 5/24/2002. This day in May was the first day in the requested time frame when Fall-run fish appeared, i.e., it's the first day of this temporal estimation sequence. Note that the 100 in the column names communicates that this is a bootstrap obtained from 100 iterations (the usual number), while the b communicates this is from before the code update containing the zero-deletion methodology. Other columns of this much larger spreadsheet (not visible) have prefixes of a, for after the update.

image

Because this behavior appears in these b columns, we can conclude that this behavior has been around for some time, i.e., before the implementation of the "deleting zeros" methodology. It probably has not been noticed before because it is rare, but also because people have probably become nonchalant of the confidence intervals, due to the exploding problem.

Unfortunately, the exploding confidence fix failed to correct this new issue -- it still manifests when bootstraps are run with 100 iterations, 200 iterations, 500 iterations, and 5,000 iterations. (Screenshot not shown -- all intervals look similar to the one shown above.) So, simply upping the sample size does not fix the problem. This suggests a deeper issue.

Next steps will include running the ALL Runs report to isolate the actual sampling (for the standard 100 iterations), and investigating the resulting histogram of 100 values for this Fall run and day in May. We will also run this analysis (perhaps with just 100 iterations) on a different year of data for this river, just to make sure the problem is not tied to a few trapping instances with screwy data, tied to this one year. I doubt this issue is contained for just this one set of parameters, but it doesn't hurt to do an easy check.

jasmyace commented 8 years ago

I have appeared to resolve this issue. It was, in fact, two separate issues, with one issue attributable to Trent from before I got here, and another to an "update" I made when I first got here.

Recall that investigations here focus on the Sacramento River Fall run, 12-01-2001 - 11-30-2002, trap 42075, which is Gate 7 E. This trap caught non-zero fish starting on May 24, 2002, and ending on September 11, 2002. Investigations focused on what is happening with respect to the bootstrap on May 24th.

Keep in mind that, given a day, the bootstrap creates 100 iterations / re-estimates of the original point estimate of fish, for that day. It does this by considering both the catch, and the efficiency. (I'm skipping over a lot of details here.) An ugly, but functional, plot of what results can be seen for this trap's data from July 4, 2002.

image

Here the distribution of the 100 bootstrapped estimates of passage for this day can be seen. For example, around 5 of the 100 samples evaluated a passage of around 50,000 (the first bar from the left), while 25 or so evaluated a passage of around 65,000 or so (the second bar from the left). You can sum up the integer counts of all the white bars to see that they sum to around 100. The original estimate of 83,452 is highlighted via the vertical green line (and itemized explicitly in the title). Additionally, the left- and right-endpoints of (51,520 , 147,207) are discernible via the two blue vertical lines (and itemized explicitly in the title). While the distribution here is a little right-skewed, that's okay, since the method of bootstrap we utilize (bias-percentile corrected), corrects for this. The numbers 0.33 and 0.62 in the title are unimportant here. This plot for this day looks as it should.

Here is the plot for May 24, 2002 -- the problem day. Note how the green passage estimate is outside the blue line confidence lines. This is the problem. Note how my graph here didn't spit out the integer counts along the left -- this is due to this being an ugly, but functional graph. The histogram depicts 100 samplings of passage for this day, with a passage estimate of 121,049.

image

In looking at this second plot, however, it seems that the distribution appears more or less the same as the first one I showed from July 4th. In fact, it doesn't really seem like the bootstrap distribution is wrong at all -- it seems that the green line is in the wrong place. This is where I started to investigate.

Jason's Problem

I took a look at the baseTable for the Fall run output with these data, and restricted to May 24, 2002. Just to be clear, the final estimates of passage and confidence intervals for which we're focusing appear in the associated Fall_passage_table.csv.

image

Keep in mind then that for this day, passage is averaged over all functioning traps on this day. On this day, there were three. You can see that (0 + 0 + 363,146) = 121,049, after rounding.

image

However, note that the efficiency is NA for those days where passage is estimated as zero. This means the data aren't there, i.e., an efficiency curve wasn't estimated for those traps. (This appears to happen when the catch for a trap is very low, and appears to happen occasionally.) Given the lack of a denominator, it seems that the passage here really shouldn't be estimated.

Originally (before I got here), passage estimates for instances like these would, like the efficiency, report an NA. The use of an NA tells R to ignore it. So, I put these zeros, which I had coded in, back to NA. In retrospect, this is correct.

I changed the NA to report a zero either because

  1. I thought the zero looked better (:see_no_evil:), or
  2. I needed passage to not be NA for whatever update I had to insert later downstream in the code.

I don't like ugly things, and feel rather sure my change was only cosmetic, and so feel comfortable just simply putting it back. In this case, this means the estimated passage for this day would then be 363,146, since only that one trap was working, which of course, is correct. :+1:

But, the original bootstrap histogram suggested that the green-line passage estimate should decrease instead of increase. :flushed:

So, I kept looking.

Trent's Problem

Trent's bootstrap function creates two giant matrices -- one holds bootstrapped passage estimates, while the other holds bootstrapped efficiency estimates. Each has rows equal to the number of days each trap was running, over all traps, while their columns equals the number of repetitive samples. So, all matrices currently have 100 columns, since we sample 100 times. The number of rows varies based on the run. Here, however, since a large number of traps was running, each over some variable number of days, there are more than 1,000 rows. Once both of these matrices are created, they are elementwise divided. This means that if the 453rd row of the 78th column in each is obtained, and the total estimated catch in that row and columns is divided by the same from the efficiency, an estimate for the passage for the 78th bootstrapped sample, tied to the 453rd row of data (which corresponds to some particular day on some trap), is obtained.

Generally, there are three situations that occur, given a day and trap:

  1. The trap was operating continuously during the day, and so there is no imputation;
  2. The trap was not operating at all during the day, and so there is full imputation;
  3. The trap was operating off and on during the day, and so there is some imputation.

Scenarios 1. and 2. bootstrap correctly. Scenario 3. does not. Returning back to trap Gate 7 E on May 24, 2002, note that 141 + 1,322 = 1,463 legit fish were captured, while 268.1 were imputed, providing a total of 1,731.1 fish. Observe as well that the estimated efficiency here is 0.0048.

image

Based on the description of how bootstrapping works above, this means that the catch numerator should bootstrap around 1,731.1, while the efficiency should bootstrap around 0.0048 -- in this way, both would create distributions around these values. This means that the resulting passage, after division (for each of the 100 samples), would also center some distribution around 1,731.1 / 0.0048 = 360,646 or so, which is exactly where it should be. But we saw above that the distribution on this day centers incorrectly around a passage estimate of 50,000 or so.

Note that, if instead of using a value of 1,731.1 in the numerator, I use instead 268.1, the imputed value. Then, the passage estimate would be 268.1 / 0.0048 = 55,854. This is, in fact, what the current bootstrap process is doing on those days for which there is both valid catch and imputation -- it's bootstrapping only on the imputed value of 268.1 (by itself, this is correct), but then failing to add in the 1,463 legit caught fish. This is why in some cases the bootstrap confidence interval looks wrong. It's not the case that the right endpoint is less than the passage -- it's really more correct to say that the passage is less than the right endpoint. The bootstrapping process itself is technically correct -- it's just centered on the wrong part of the x-axis. Luckily however, once discovered and diagnosed, this problem is easily fixed. :smiley:

image