tmcd82070 / CAMP_RST

R code for CAMP rotary screw trap platform
1 stars 1 forks source link

Mokelumne River passage reports results in varying estimates of fish #48

Closed jasmyace closed 8 years ago

jasmyace commented 8 years ago

Running the same report with the same date range and time step results in a variety of estimates. This relates to the conversation we had last week where Trent used the analogy with water storage tank and hose. Connie ran 3 ALLRUN production reports for the Clear Creek Lower RST 1.7 site and got 2 different point estimates. We were not sure if this situation was an artifact of something on her computer so I ran 3 ALLRUN production reports on my computer using Connie’s criteria. My CAMP.mdb and my platform are stored on my computer, not a server. I also got 2 different point estimates; those numbers were the same 2 numbers Connie got, but the frequency of the numbers was reversed (see the attached spreadsheet). Based on Trent’s prior statements, I have come to expect that there should only be 1 static point estimate when the same criteria are used, so having 2 estimates coming from the same criteria seems hard to explain. It is odd that the same numbers are appearing in my and Connie’s reports, which might suggest that a random selection or dropping of data is not occurring, i.e., the R code is processing the data and instead of producing 1 point estimate, is producing 2 slightly different numbers given some subtle variation in how it is processing the data. This issue may be new, and may not have occurred in prior versions of the R code, but I am not sure of that.

I note that the difference in the point estimates Connie and I are getting for Clear Creek is relatively small (1,269,477 vs. 1,269,501). There was a similar issue with the Stanislaus River – Caswell State Park data; unfortunately, the point estimate spread is bigger for that data in 2005 (239,060 vs 238,872), and there are more than 2 production estimates being developed by the R code. the screenshot below provides the point estimates that Caswell biologist developed with the R code.

image

Please try to determine what is causing this issue. We will then need to determine if it is an issue that needs to be fixed, or if it is a low priority and should not be fixed.

tmcd82070 commented 8 years ago

Odd. Very odd.

I know that one source of variation across runs is the Bias-Corrected Bootstraps we compute; however, I cannot remember whether I correct the point estimate, or just the CI.

I give the following a 15% chance of being the source of the problem, nonetheless, you (Jason) could comment out the 8 - 10 lines of code where we compute the bias-corrected bootstrap and see whether that provide the same point estimate.

My next guess is that we have somewhere pushed numbers out of range. We are dealing with some tiny numbers sometimes, and perhaps a tiny rounding error or out of range number is the issue. Once numbers get below 1e-7 or so, they essentially contain random digits. It is unfortunate that we must run 32 bit R because we are using ODBC with Access. With MS-SQL, we could use 64 bit windows, and this would give us many more numbers and potentially fix this (if indeed this is the problem).

First things first: Jason - You must determine whether the primary data sets pulled from Access are exactly the same or not. If not, the problem is ODBC and Access. If so, the problem is somewhere in our R code.

jasmyace commented 8 years ago

It took four runs, but I was able to obtain both estimates for the Spring run i.e., 1,269,477 for three, and 1,269,501 for the fourth. I think it's worthwhile to note that the one estimate occurs (for me) about 75% of the time. I think this proportion may be tied to the issue. I do not think it has anything to do with how R communicates to Access, or database size issues, etc.

For the four runs, I output Connie's final Temp table itemizing the measured/assigned counts of fish. When looking at the numbers obtained in the first set versus the second, I found that the data appeared the same, although the sort order was different. I don't think the sort order is causing the discrepancy; I do, however, take the resulting differences in sort order as evidence of something variable happening before they are output to the database. Keep in mind that this Temp table itemizes out counts of fish prior to plus-counting.

Next, for both runs, I compared the resulting baseTable of daily fish output, with its accounting of fish. On 2/22/2014, the 1,269,477-run has 5 assigned fish, and one unassigned, while the 1,269,501-run has 6 assigned fish, and one unassigned. This one extra fish, when coupled with the 0.0501 efficiency on this day, leads to an extra 20 fish. The remaining four fish (when comparing 1,269,501 to 1,269,477) are due to two imputations (so two days) that occur near 2/22/2014. So, the bump of that one extra assigned fish on 2/22 (in addition to the +20 on 2/22) adjusts the imputation just slightly (< 1 fish) for two neighboring days, leading to 4 extra total fish. So, that is how the numbers are different. This does not explain the why.

I think the fact that one mystery assigned fish appears is tied to the plus-count routine...somehow. I note, that in the R comments that are output during running, that "Number of total fish before expanding and assigning jointLevs = 861957," but yet "Number of fish after expanding and assigning jointLevs (should match count before expansion)= 861960." I don't know what happens when these two values are (unexpectedly?) not equal.

Finally, it's also worth pointing out that there is one small place where randomization occurs in the plus-count routine -- rounding. The plus-count routine contains this telling comment: "Randomly allocate the rounding error to classes." I take this to mean that little fractional bits of fish get randomly assigned to different classes (not sure if this is lifeStage or run or both), and so, by design, small changes in output could occur...maybe? In fact, this is why I think that if I were to run the data several different times, the two (three, depending on class?) different numbers obtained would settle to some percentage of runs, which may possibly tie to the percentage of times the random allocation of extra fractional fish gets assigned to this class, or that one (or maybe another one).

So, as requested, I stopped here with a working hypothesis. I could continue to dig deeper, but suspect this information will be useful to Trent to perhaps guide a fix/next steps.

jasmyace commented 8 years ago

DataRunInvestigations.xlsx

tmcd82070 commented 8 years ago

Nothing special happens when the count into PlusCounting does not equal the count out of PlusCounting. Output of these counts is just a check, and seems like it served it's purpose in this case.

. Jason: Dig into PlusCounting and find out why the count in and count out are different. It could easily be the random allocation of rounding error to different categories, we will need to explain this to Doug and Connie in order to formulate a solution. Perhaps we take out the random assignment of left-over fish and assign to a fixed category.

On Sat, Oct 3, 2015 at 2:58 PM, Jason notifications@github.com wrote:

DataRunInvestigations.xlsx https://github.com/tmcd82070/CAMP_RST/files/7227/DataRunInvestigations.xlsx

— Reply to this email directly or view it on GitHub https://github.com/tmcd82070/CAMP_RST/issues/48#issuecomment-145289914.

DWH ATTORNEY WORK PRODUCT / ATTORNEY-CLIENT COMMUNICATIONS

Trent McDonald, PhD Senior Statistician

Environmental & Statistical Consultants 200 S. Second Street Laramie, WY 82070 (307) 721-3172 (307) 760-4721 Cell tmcdonald@west-inc.com www.west-inc.com

Follow WEST: Facebook http://www.facebook.com/pages/Western%E2%80%90EcoSystems%E2%80%90Technology%E2%80%90WESTInc/125604770807646 , Twitter http://twitter.com/WestEcoSystems, Linked In http://www.linkedin.com/company/1458419, Join our Mailing list http://visitor.r20.constantcontact.com/manage/optin/ea?v=001qrD4A3S5xJ5KgMyelH9jyw%3D%3D

CONFIDENTIALITY NOTICE: This message and any accompanying communications are covered by the Electronic Communications Privacy Act, 18 U.S.C. §§ 2510-2521, and contain information that is privileged, confidential or otherwise protected from disclosure. If you are not the intended recipient or an agent responsible for delivering the communication to the intended recipient, you are hereby notified that you have received this communication in error. Dissemination, distribution or copying of this e-mail or the information herein by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the original message. Thank you.

P Please consider the environment before printing.

ConnieShannon commented 8 years ago

When you said, "For the four runs, I output Connie's final Temp table itemizing the measured/assigned counts of fish. When looking at the numbers obtained in the first set versus the second, I found that the data appeared the same, although the sort order was different. " Was this in regard to the output to the final temp table? If so, we could try working with the sort order in the SQL so that the data are always output in the same order. Let me know if you want me to modify the SQL.

jasmyace commented 8 years ago

Q: Was this in regard to the output to the final temp table? A: Yes, table TempSumUnmarkedByTrap_Run_Final.

I don't think the sort order matters at all in terms of data processing, and it doesn't matter to me. I use this table a lot in checking things, but the fact that I never noticed the sort may vary with different runs in minute non-noticeable ways implies (to me) it doesn't require an update/modification.

ConnieShannon commented 8 years ago

I must have misunderstood. I thought the sort order was causing the problem of variation in the final prod. est. If it isn't broken we shouldn't fix it...right?

jasmyace commented 8 years ago

From Doug:

"Issue 2: Variable point estimates coming from the salmon production reports

Threloff has told the field biologists they should expect that the point estimates coming from their salmon production reports are static numbers, e.g., if they run a production report using the same criteria, they should get the same point estimate. In fact, the point estimates may not be static, and they can vary by a relatively small number of salmon depending on how the R code processes plus count salmon. In some cases, the R analysis may encounter a partial salmon that is interpolated as the plus counts are allocated, and that partial salmon could result in variable point estimates as daily catch is expanded by the trap efficiency. Trent believes this underlying issue has always been present in the R code, while Connie believes the issue may have arisen more recently. Doug believes there may not have been a serious, prior effort to rerun the same report several times using the same criteria to see if the point estimates were static or slightly variable.

We discussed 3 potential solutions:

Option A) Educate the biologists and inform them their point estimates are likely to always be the same if the same report criteria are used, but on occasion they may not be the same. With this option, Doug and Connie should periodically touch base with the biologist to assess how variable the point estimates are, if such occurs. Trent believes the variable estimates should vary by a relatively small amount, e.g., less than 100 salmon, which, in the case of a 2,000,000 salmon production estimate is relatively small and inconsequential. Doug and Connie note that a fixed number, e.g., 100 salmon, may not be a good benchmark for assessing variability, and a percentage may be a more informative metric in watersheds where the production estimates are relatively small, e.g., Caswell State Park.

The advantages of this solution are: (1) no remedy is required, and (2) it will not require the expenditure of funds that have been obligated to the West subcontract.

The disadvantages of this approach are: (1) biologists may experience some level of confidence in the CAMP RST platform production estimates because they have previously been told the production estimates should be static, but they are now told they can vary in some circumstances.

Option B) Have the biologists assign a salmon run and life stage to each record where there is a plus count.

The advantages of this solution are: (1) it would like result in no cases where there are variable production estimates, and (2) it will not require the expenditure of funds that have been obligated to the West subcontract.

The disadvantages of this approach are: (1) it would defeat one of the fundamental advantages of the R code, i.e., in an automated fashion the R code assigns a salmon run and life stage to the plus count salmon based on the proportions of measured and attributed salmon.

Option C) monitor and track the observations of the field biologists, and if they note variances in the point estimates that exceed 100 salmon or note a 0.5% variance among the production estimates, revise the R code so it traps and resolves the issue of having a partial salmon that causes a variation in the point estimate.

Resolution: For now, we will adopt Option C, i.e., monitor and track the observations of the field biologists, and if they note variances in the point estimates that exceed 100 salmon or a 0.5% variance among the production estimates, and revise the R on an as needed basis."