Preemptively Identifying Common Errors in the Estimation of Passage

OVERVIEW: Data are sometimes not sufficiently prepared for analysis when estimating passage. Although the R code contains several handlers for these contingencies, all communicative error messages are currently buried within the text R.out file, requiring investigation. This makes identifying next steps and subsequent resolution difficult for users not readily trained in the procedure to rectify problems; i.e., dealing with errors is not user friendly.

RESOLUTION: Check for common errors in the Visual Basic tied to the Platform before calling the R code for passage.

DETAILS: Working with CAMP biologists, identify a list of common problems that result in the estimation of passage. Do something when these problems arise.

Generally in error checking, problems are "errors" or "warnings." Errors are fundamental flaws which cause the code to break, and with good reason. Warnings are (maybe) problems that the user should know about so that he or she can decide what to do about them (which may be nothing). I try to differentiate between errors and warnings below.

Lack of requisite data -- Error

To me, these are the most common: I'm usually not aware before I perform a passage estimate when a river suffers from no efficiency trials and/or catch data either not existing at all, or technically not available, due to the selective use of the includeCatchID codes. Perhaps these errors are less prevalent for biologists running data on their own rivers (whom I imagine are generally more aware), but for me, these are the big ones that would benefit from this upgrade.
1. No efficiency tests found.
2. No trap visits found.
3. No catch records found.
Note that "catch records" can be defined in two separate ways: No catch at all in the specified date range, and no catch records that have the correct includeCatchID. Checking for both of these could be useful, depending on the investigations of the biologist.

It would also be useful to think about the inclusion criteria for a catch record -- we have seen that sometimes, catch records span multiple days. I think the general rule is to include it if that catch's EndTime falls within the specified date range, but it might be worthwhile to think about what to check in these cases where catch doesn't start and end within one day.

Finally, note that pre-emptively checking these, prior to Connie's queries, is tricky. The data we care about are always a subset of what a simple date-based data pull will find. Said another way, to do this right, you have to actually run Connie's queries to apply the conditions we need so that the data are appropriate for use in analysis. There could be plenty of efficiency records for a date range, but none after Connie's queries sweep through and apply our rules. I'm not seeing how we can check these correctly and consistently every time without first calling Connie's queries. But then, the R code would then re-run all these queries in the process of estimating passage. This isn't great. I would be curious to hear what Connie thinks.
Checking for appropriate data in key data fields -- Warning

We have found that sometimes, QA/QC has failed to identify issues in the data. In these cases, data lack key features. For example, negative SampleMinute values lead to problems, as do missing StartTime and EndTime values. Given a date range, and the general rarity of these type of issues, it may be useful to identify problems. However, I'm not certain we would want to stop the code, but just make the user aware of data non-compliance. Without running Connie's queries first, a check of this type will never check the correct data; if anything it would check a larger set of data than that which would ultimately be used. Simply stopping everything would be overkill. Additionally, sometimes the code runs regardless, and the results it provides may still be useful.
Problems with the temp tables -- Error

This "error" is tricky. Connie's queries always work -- unless the data being fed to them are screwy. This could occur when, say, no efficiency tests are found. In this case, certain tables are not created, and so Connie's queries break. (In reality, in this case with efficiency trials, Connie's queries don't actually break, but you get the idea.) The better solution is to identify all the problems that can go wrong with building the tables. But that could be a tall order. We have seen that sometimes the data are wonky, e.g., negative SampleMinutes, blanks in StartTime and EndTime (usually on the Feather), but of course many other things in many other places can go wrong as well.

As a minimal solution that emphasizes the temp tables, I would suggest code that checks for the existence of all the temp tables prior to implementation. This list of tables is constant, given an R package campR version. However, when we upgrade things, we may end up expanding this list of tables. So, if we decide to check for temp-table existence, we need to make sure we can get back into this section of Visual Basic code for easy updates.

If you decide to go this way, Connie REALLY needs to check to make sure this list includes all of the tables:

NonTrapSample1, TempChinookSampling_a, TempChinookSampling_b, TempChinookSampling_c, TempChinookSampling_d1, TempChinookSampling_d2, TempChinookSampling_d3, TempChinookSampling_e, TempChinookSampling_f, TempChinookSampling_g, TempChinookSampling_h, TempChinookSampling_i_final, TempEffortSummary_a, TempEffortSummary_b, TempNonSamplingSummary, TempRelRecap_1, TempRelRecal_3, TempRelRecap_final, TempReportCriteria_Release, TempReportCriteria_Trapvisit, tempSamplingCHN, TempSamplingSummary, TempSamplingSummary_Final, TempSumUnmarkedByTrap_Run_a, TempSumUnmarkedByTrap_Run_b, TempSumUnmarkedByTrap_Run_Final, TempSumUnmarkedByTrap_Run_Final2, TempSumUnmarkedByTrap_Run_X_Final, TrapSample1
Date range stuff -- Error or Warning, depending
1. Too long -- note that our date-range check stops if the difference between the start date and end date is greater than 366 days. This allows for true annual estimates to include leap days in leap years. Very explicitly: date ranges can be 366 days.
2. Too short -- we do not have any checks that look for date ranges that are too short, although Connie may have something in her queries. I believe that if you put a date range from 1/1/2015 – 1/1/2015, the code will generate estimates for that one day (provided there are catch data with the appropriate includeCatchID and efficiency-trial data). If the requisite data do not exist, then this defaults to number 1. above.
3. End date earlier than start date – to ensure that the time frame is less than or equal to 366 days, we calculate dt.len = end.dt - start.dt, where dt.len is the length of the time period in days. If the end date is earlier than the start date…I’m not readily sure what happens. We only currently check for positive date-differences greater than 366 days. I suspect what happens when the dt.len is negative (so that the end date is earlier than the start date), is that the code continues and hits Connie’s queries (which occur immediately after the dt.len calculation), which then subsequently return no efficiency data (because the dates are screwy). So, in this case, the code stops and defaults to number 1 above, having resulted in a lack of efficiency data.
Selecting the wrong subsite in the drop-down -- Warning

Some of the subsite drop-down options in the Platform contain locations that I've never used. Are these appropriate for estimating passage? Should we check for appropriateness? Or could the drop-down list be whittled? Keep in mind I only ever use the Big Looper subsites, and so several of the drop-down options that I personally never use may actually be appropriate options.
File name has unacceptable characters -- Error

In my coding, I never have this problem, because my filenames are generated programmatically created in the Big Looper, or because I name files in the actual Platform interface very simply. So, I cannot really rank this. What do people enter? Question marks and the like? I'm not sure what the naming rules need to be -- do they only need to respect R rules, but R rules and Visual Basic rules as well?

In any case, restricting filenames to letters, underscores, and numbers usually does the trick. Not starting the filename with a number should probably be specified as well. Finally, we may also need to check for file-name length. Windows breaks if the specified filepath + filename is greater than 255 characters. So, if the Output folder where Platform-enabled output goes is buried deep within a file structure, the filename itself can only handle so many characters, i.e., 255 minus the length of the filepath. I find this is often a tricky problem to diagnose. So, maybe we restrict the filelength? Krista would have thoughts on all of this.

Krista mentions some other things she has seen. I'm not sure I readily understand / agree with them.

Final run assignment was not run

(Krista) I’m not sure this actually causes the analysis to abort, but I have run analysis with the vast majority of fish not having a final run, and it took me awhile to figure out why it was giving me such low production estimates.

(Jason) We could look at the distribution of FinalRun and lifeStage (to see if there are blanks, if a lot of the specified catch has the wrong includeCatchID code, if FinalRun contains data other than Fall, Late Fall, Winter, Spring, etc.), but I feel like for this to occur, Connie's queries must first be run, which defeats the purpose of a quickie beginning check.

I'm not readily sure how a lot of fish fall out of analysis. I want to say that if a majority of fish do not have a final run, they should be Unassigned, and so plus-counting should come into play. I'm not sure I understand how Krista loses fish, although I am biased to running production on rivers with good data (via the Big Looper). I would be curious to know how she loses all her fish in cases like this.
Long time periods without catch or before catch is expected (this does not cause the R code to abort, just causes it to stall out)

(Jason) I like to think these will be minimized by our including the 7-day gap-in-fishing criteria. Krista would not have this update yet. We could check to see if, given a date range, catch exists in the time period immediately before and after the specified range. This way, we could help biologists capture extra fish data about which they may not be aware.

We could also notify the user if the possibility of a 7-day gap exists. This could be a bit tricky if a user chooses to manipulate the value of 7 days in the R code, since the use of 7 in the Visual Basic would be fixed. Note however that changing away from the value of 7 days could only be performed by the most advanced of users. As such, this is probably not a practically valid concern.
Critical errors were not corrected

(Jason) I think this refers to problems in the R code not getting rectified quickly as far as the biologists are concerned, since there is a delay between problem identification and problem resolution, i.e., when the next CAMP version is pushed to biologists. Hopefully these will become fewer and far between as we keep cleaning up the code. Of course, these types of errors are usually very river-specific, and so can't be checked prior to actually running the code.

Thank you very much Jason. This is a very good start. I thing the last few errors are related to the data itself and not the R app. For example there is a query in the QC application that identifies potential "critical errors". I won't address those here.

What do you think about checking for the following: Date range problems No valid efficiency tests found. No valid trap visits found. No valid catch records found.

In regard to the date fields not being null: I think this happens when the first trap visits in the series, based on user provided dates, aren't
trap set records. I may also be able to check for this.

Regarding selecting the wrong sub-site in the drop-down -- Warning: Jason you are right and there are all sorts of sites listed in the drop down that are not used for data analysis. The same problem exists for Taxon, Date range, and Run. That is, unless we
intend to provide a production analysis a for goldfish, for example.

My suggestion for this issue is to change the source of the drop down. This is a criteria form and therefore should display options found in the data. It should not be based on the lookup tables. In other words, Karen and Mike used the site table as the source for the drop-down. It makes more sense to use a list of sites based the trap visit table to populate the option list. This way the only options presented are those for which data are available. This is simple to do in access and I can build sql for it if that helps.

Under issue 3. Problems with the temp tables -- Error: I'd rather wait and see how many errors we get after we make the above changes. As you point
out, checking all the temp tables would have to be done after data go out to R and this is what we are trying to avoid.

Connie

Maybe for these:

No valid efficiency tests found. No valid trap visits found. No valid catch records found.

using the query criteria doesn't matter.

For example, we could apply our checks on the larger data tables, prior to merging and so on, by at least restricting by date...and any other "easy" criteria. While this would not check the correct query results resulting from merges, it would at least prevent having to run all the queries to get those data, which is what we're trying to avoid.

Checking the larger data tables will identify at least some issues (but not all I feel). For example, if there is no catch in a date range on a larger data table, there will still be no catch in the same date range on a subset of that data, after applying merges and so on. While doing the checks on the larger unrestricted tables is not elegant, it may capture enough problems to be worthwhile. Said another way, the return from checking more (not-quite-right) data may be worth the effort, when compared to running both sets of queries twice.

(An hour later...)

As I think about it more, why couldn't we have a mix-and-match approach? At the end of a passage report, we know the Platform can read a statement in the R.out file, and report the number of files generated, and communicate that to the user. Why couldn't we, if an error occurs, trigger something in the Platform, so as to report the error in the (usually) last line of the R.out file? I feel like Visual Basic must have some type of error handling function. If so, we could then code nicely worded pop-ups to the user that a table wasn't found, or no releases were found, and so on when an error occurs. Of course, we would also exit everything nicely. In this way, we could check the data on the correctly parsed data, in the most practical spot in the code. This would ensure that checks occur where appropriate, and so Connie's queries need only run once. Of course, we would also do the "easy-checks" before the R code starts, directly in Visual Basic.

So, we would stagger the checks to maximize the already written sequence of code. One set of "easy checks" written in Visual Basic would look for date-range issues, and the like, before the R starts. A second error-handler (encased in the Visual Basic) would parse out the communicative errors written to the R.out file, with a "last-resort" message interpreting true errors we have yet to see, and for which we generally cannot hard-code a pop-up action. However, when these occur, they can be added to the list so that the error handler gets smarter as time goes on.

Jason; I'm not sure I understand what you mean "using the query criteria doesn't matter." We must be thinking of two different things. We might need a phone call. I can use very simple queries to do the things I describe. It would not require running all of the sql twice.

Also, this morning Doug and I were talking and there is a good solution to the missing temp table problem. In our preliminary checking code we add a section that checks for missing temp tables, if one is found it is created. No need to halt the process and/or wait for user feedback. It would work like this: Run code to check if temp table 1 exists If it does, move to next check If it doesn't run sql to build it. Since we are running this vb from inside the user interface we can use a regular make table query. the same process would be used to check for all temp tables.

Krista wrote: I imagine Jason will be able to put together a more comprehensive list, but here are some of the ones I’ve run into, or heard of other people running into.

No efficiency tests found
No trap visits found
Database doesn’t have temp tables (example: TempRelRecap_1)
Date range too long or too short
Date range with end date earlier than start date
No trap visit set to “include in analysis=no”
Critical errors were not corrected
Final run assignment was not run (I’m not sure this actually causes the analysis to abort, but I have run analysis with the vast majority of fish not having a final run, and it took me awhile to figure out why it giving me such low production estimates)
File name has unacceptable characters
Long time periods without catch or before catch is expected (this does not cause the r code to abort, just causes it to stall out) Thanks, Krista

Connie responds to Krista

No efficiency tests found _ Yes user criteria form needed
No trap visits found_Yes user criteria form needed
Database doesn’t have temp tables (example: TempRelRecap_1) _Yes (see my comments above)
Date range too long or too short_Yes checking user entered criteria should be the first step.
Date range with end date earlier than start date_Yes checking user entered criteria should be the first step.
No trap visit set to “include in analysis=no”_No this is covered with check #2.
Critical errors were not corrected_Not sure
Final run assignment was not run (I’m not sure this actually causes the analysis to abort, but I have run analysis with the vast majority of fish not having a final run, and it took me awhile to figure out why it giving me such low production estimates)_This may be possible and should return only a prompt that informs the user and asks if they want to continue anyway or stop. This is because I think Jason said the program will not fault when this occurs.
File name has unacceptable characters_Please explain
Long time periods without catch or before catch is expected (this does not cause the r code to abort, just causes it to stall out)_Not sure

On July 11, 2016, Connie sent out a summary of the current set of checks to be programmed in Visual Basic, based on a call conducted on July 8, 2016. These are contained in the attached Word document.

Improving the CAMP User Interface Analysis Criteria form.docx

Some comments on the Word document:

The date range in number 2 of the Word document should be "> 366 days." We allow for the inclusion of leap days in leap years.
Also in number 2 of the Word document, we don't have logic that redefines the start and end dates provided by a user to the Platform. So currently, if a user selects a 388-date range, but a 90-day range of trapping occurred, the R code fails.
I point out that we need both of numbers 5 and 6 tied to catch records. In the estimation of passage, we care only about includeCatchID = 1 records. But, the version of R code that Doug is currently testing allows, for example, the inclusion of includeCatchID = 2 records in the plot forklength over time report.

The work to update the Platform to preemptively identify possible trouble spots is complete. This work was completed by Pacific States.

tmcd82070 / CAMP_RST

Preemptively Identifying Common Errors in the Estimation of Passage #94