Add 'raw.count' and 'inflated.count' columns to CSV files

tmcd82070 commented 9 years ago

This is a good idea because it allows biologists to validate their data.

jasmyace commented 9 years ago

Doug: "After thinking about things more and talking to Connie, we would like to see 3 catch columns in the daily CSV files. Having the 3 types of catch in the daily CSV files will make it easier to validate the outputs from the R code.

I suggested yesterday on the phone you have a “Imputed catch” label but that is not really accurate.

The 3 columns we would like to see in the daily CSV files are: “Unmarked catch” “Imputed catch” and “Total catch”.

The columns reflect the following:

Unmarked catch: the raw, unadjusted catch that only includes salmon that were not marked in some manner. Connie’s SQL produces these data, and they are already in the R code somewhere. This column in not currently in the daily CSV files, but should be added.

Imputed catch: the catch that is imputed by the R code and adjusts for days/times not fished. Trent has already developed the code that creates the imputed catch. This column in not currently in the daily CSV files, but should be added.

Total catch: the value that gets expanded by the estimated daily trap efficiency. Total catch includes Unmarked catch + Imputed catch. I strongly suspect, but don’t know for sure that this column is currently labeled as “Catch” in the daily CSV files. The “Catch” label should be relabeled as “Total Catch”.

After you have the 3 catch columns in the daily CSVs, I will randomly select 25-50 raw data sheets to make sure the Unmarked catch column is presenting the correct numbers. The Total catch column is easy to check by adding the Unmarked catch to the Imputed catch. I can’t validate the Imputed catch because those #s are coming from a GAM, and the best I can do is do a visual check to see if the numbers seem to be about right given the unmarked catch on adjoining days when catch was not imputed."

jasmyace commented 9 years ago

This has been completed.

Previously, column 'Catch' summed both the observed number of unmarked catch, and the imputed number of estimated catch. Note that sometimes, the entire day's estimate may have been imputed, may have simply be observed, or be a mix of the two, if perhaps catch observations number 2 or more, and one have been missed.

The updated file, for now, keeps the original 'Catch' column. It however, includes a few new columns. The first, entitled rawCatch, contains the integer counts of observed unmarked catch, and is the sum of all fish observed in that particular day's catch. Necessarily, this variable is always an integer. These values originate via Access table query-result table TempSumUnmarkedByTrap_Run_Final, as the sum of values in variable Unmarked, after grouping by trapVisitID and SampleDate. Note that the resulting groups are, in general, grouped by the counts associated with different fish lengths.

A second column is ImputedCatch. While an easy way to obtain the estimated ImputedCatch would be to take the previous Catch, and subtract the now available rawCatch, I think this is a little lazy, and potentially sloppy, as it assumes the resulting difference actually is the desired imputed catch. So, I obtained the imputed counts by manipulating function F.plot.catch.model to output a dataframe table of imputed counts during the looping of the repeat loop (over the unique set of catch catches) that begins on line 151, and evaluates if the sample minutes is too great, i.e., enough time passed to where an observed sample count is missing, and for which imputation is required. The resulting dataframe, entitled jason.new (new because pre-existing dataframe new houses the predicted values [along with other output], and jason just because). Dataframe jason.new, containing variables batchDate, trapVisitID, trapPositionID, and n.tot (the predicted/to-be-imputed values) is then housed as a new member within the summary list created at the end of the function, around line 332, and thus available for future use as output list df.and.fit via parent function F.est.catch around its line 47.

A third column is totalCatch, and is the sum of new variables rawCatch and ImputedCatch. In theory, this value, when rounded to the nearest tenth, should emulate previous column Catch.

A fourth column is correct. This columns displays either TRUE or FALSE; a TRUE occurs when the difference between totalCatch (the new column) and Catch (the old column) is zero. This means that everything for that line is correct. A FALSE would indicate otherwise. The syntax for this is coded in function F.est.passage immediately prior to the creation of the daily CSV file baseTable.csv, around line 220. This column will not be present in the final release, but is included for now to more easily demonstrate the logic used. Finally, a switch has been included within function F.est.passage, around line 181, to check that the sum of the values of TRUE in column correct equals the number of lines in the resulting dataframe/csv output. Values for which the count of TRUE is less than the number of lines indicates a line item for which the calculation of totalCatch (and so either [or both] of rawCatch and ImputedCatch) is wrong is as well; when this occurs, the code is setup to stop, with a message for the user to investigate its cause.

jasmyace commented 8 years ago

Completed as part of May 15, 2015 update.

tmcd82070 / CAMP_RST

Add 'raw.count' and 'inflated.count' columns to CSV files #19