SOW 2 - Task 2.8: Modify Plot Fork Length Report

jasmyace commented 9 years ago

For the "Plot fork length through season" report, revise the R code so: (1) fork length data from traps where the "Include data in analyses?" = "No" and "Yes" is included in the CSV and the PNG, and (2) the X-axis label is changed from a format of Feb14 to 01Feb14.

jasmyace commented 8 years ago

I modified a line of code in order to make the x-axis reflect the dates as requested.

I can't figure out what item (1) refers to above. Is it referencing the includeCatchID field?

Once I complete (1), I'll throw this program to the big looper program, in order to make sure its update(s) perform as expected.

This enhancement (of item (2)) involved manipulation of line 103 in program size_by_date.r. I've included an example, where it's clear a preceding date is included with the month and year along the x-axis.

week_american_testing_2013-10-01_2014-09-29_size_by_date

dougthreloff commented 8 years ago

Jason:

Here are my notes that led to Task 2.8: The R code is pulling the correct data from the CAMP.mdb and producing accurate CSVs and PNGs BUT fork length data from traps where the “Include data in analyses?” = “No” is not being included in the CSV or PNG.

I therefore suggested that the R code producing the Plot Fork Length Report outputs be modified to include fork length data from traps where the “Include data in analyses?” = “No” and “Yes”. I did that because the fork length data is unlikely to be affected by whether or not a trap is functioning properly on a day, and all of the fork length data should therefore be included in the Plot Fork Length Report CSVs and PNGs.

Doug

jasmyace commented 8 years ago

Okay, I believe I see now.

CONNIE, I need some help.

I believe what I call includeCatchID is termed "Include Catch" in your query result table TempSamplingSummary_Final.

Ignoring the date ranges in the example plot I put before in this Issue, I investigated passage for a query on the American from 1/9/2015 through 5/29/2015. During this time period, I see that Trap Visit ID 738 on 2/21/2015 was thrown out via inspection of TempSamplingSummary_Final. I see there were 1,019 fish during this trapping instance. Subsequently, I can also see in query table result TempSumUnmarkedByTrap_Run_Final that the period of time associated with Trap Visit ID 738 ends up getting labeled as TrapStatus Not fishing. I believe that currently, these 1,019 fish are not included in the plot, but Doug wants them added.

Where can I pull the summary forklength information for an instance like this so it can be included in a plot like that depicted above (in the online version of this Issue)? Do we have query results that itemize forkLength for samples that are thrown out? Is it ever the case that the biologists, knowing they're going to throw out data, don't bother to do the measuring, because it may not matter anyway?

ConnieShannon commented 8 years ago

All of the data I have summarized for you does NOT include data selected to be excluded for analysis.

If some of the data should be included for a specific purpose then I can either provide a new set of sql. How soon do you need this?

Question: Does this mean that the number of measured fish used for analysis and reported in the passage_table.csv? If so, we will need to provide information so that the client understands why the numbers of measured fish in the passage_table.csv are different from those in the Plot Fork Length Report.

jasmyace commented 8 years ago

This is what I inferred.

I think we should do NOTHING that modifies the current queries for catch. They work as expected, and generalizing them to add in more data, etc. just seems silly, as we don't really know how extra data might subtly change what the queries currently return. Any updates for this Task (and Task 2.9) should not modify passage at all.

I'm hoping that, alternatively, we can modify the current SQL to summarize fish over bad catches. If possible, we could then have a TempSumUnmarkedByTrap_Run_Final as we always have had for good catches, along with a TempSumUnmarkedByTrapBad_Run_Final (or whatever), in the same columnar structure, for bad catches. I could then stack (UNION in SQL-speak) the two results, summarize over days and forkLengths, etc., and call it a day.

I think it would be better to keep the good and bad catches results as separate queries, as those results could then be used by others to compare characteristics of good versus bad catches, etc.

In the next couple of weeks would be good, although the timing is awkward now, with the holidays.

dougthreloff commented 8 years ago

Here are Doug’s notes when he was summarizing his thoughts in regards to Task 2.8:

“The R code is pulling the correct data from the CAMP.mdb and producing accurate CSVs and PNGs BUT

1) fork length data from traps where the Include data in analyses? = No is not being included in the CSV or PNG.

2) the R code is including salmon that were not randomly sampled in the CSV and PNG. This could create a misleading result because unusually large, rare salmon that were initially nonrandomly sampled and classified as spring-run salmon, but were later reclassified as fall-run salmon, are being included in the R outputs. The R PNG therefore must be interpreted as providing a plot of the size range of the salmon run, and not a weighted average of the frequency of the size of the salmon caught. Long-term, it may be more accurate to revise the R code so it is only using data from randomly selected salmon to develop the CSV and PNGs. It may also be wise to have a discussion of whether fork length data from trap where trapping problems occurred should be included in the R outputs.”

I don’ think we should change Connie’s SQL or the R code in a way that affects the production estimates, i.e., only the fish in the Include data in analyses? = Yes should be used to calculate production. It seems to me that the issue of plotting a fork length is different than calculating production, which is why I was advocating that for the fork length plots, salmon where the Include data in analyses? = Yes and No be used. I suspect that the fork lengths of salmon aren’t affected by whether or not a trap is functioning properly, which is why I advocated that Include data in analyses? = Yes and No be used.

I called Jason on December 23 2015 to clarify what he was advocating in his paragraph that begins “I'm hoping that, alternatively, we can modify the current SQL to summarize fish over bad catches.” And I agree with what he is suggesting to do, provided that Connie can give him the SQL and outputs he is asking for. To streamline the development of that SQL, I think it would be best if Jason and Connie get on the phone to discuss that SQL so there is clarity and efficiency on what each of them will do and can expect. There should then be a post from Connie to GitHub explain what she gives Jason.

In my post above, I mention issues with plotting fork lengths for random vs non-randomly sampled salmon. I can see advantages in either case. For now, I think we can continue to plot fork lengths for random and non-randomly sampled salmon. I will need, however, to explain to biologists why they may see a few outliers with unusually large fish; these salmon were likely to be non-randomly salmon that had their final runs changed (e.g., At capture run = Spring that were changed to Final run = Fall).

Thanks.

Doug

ConnieShannon commented 8 years ago

Your plan looks good. I will develop a separate output for catch that was excluded from analysis. Are you going to need only the measured fish? Will you be processing the unmeasured catch and assigning run and life stage?

ConnieShannon commented 8 years ago

A drawback to this plan will require that additional temp tables be placed into each working database or the R code will fail during analysis. This is because we have decided not to fix the issue of missing temp tables but chose instead to add them each time. If my count is accurate there are six databases plus the new American River historical databases that will need to have tables imported.

ConnieShannon commented 8 years ago

Monkey wrench... At times data are excluded from analysis because there is a problem that can't be fixed. For example during data migration or QC a trap visit record with catch may be identified as having the wrong trap position, year, or gear code. The program knows enough about the data to exclude the record from analysis but they don't delete the record because they may figure out how to correct it in the future.

If we want to use the FL data from only some of the excluded samples, then the rules for using the Include Catch field must be set and records that have been excluded because of data errors should be removed.

I don't see this as being too much of a problem for the current and future data. It is something that will be more of an issue in the older migrated data unless we just search out and remove these types of records prior to using the new R analysis.

ConnieShannon commented 8 years ago

The sql and documentation to summarize excluded catch for the length frequency tables is attached. I've recycled two workup temp tables used earlier to develop TempSumUnmarkedByTrap_Run_final. As a result the new sql series requires only one new table named TempSumUnmarkedByTrap_Run_X_final be added to the CAMP.mdb.

The BuildReportCriteria and BuildReportCriteria_Releases as well as the sampling summary series do not need to be run first so long as this new series is run following the series to create TempSumUnmarkedByTrap_Run_final (documented in SQL for developing unmarked Chinook by run and life stage_19May2014.txt). Data in the final tables is not modified and can be re-used for other purposes.

Since all of the fields are the same as in table TempSumUnmarkedByTrap_Run_final, that table can be copied and renamed before running the code to ensure that the code will find an existing table TempSumUnmarkedByTrap_Run_X_final.

The output includes all unmarked juvenile Chinook excluded from analysis including unmeasured fish that haven't been assigned a run or a life stage (plus counts). If any adjustments need to be made I will be here Monday.

I understand that you must modify my sql before using it in the R application. Please let me know if I can help with this process. I have a database that I'm putting the sql into so that you can see it working inside access. I think the access query wizard will help you visualize the sql. I should be finished with the mdb Monday and will post it for you as a zipped file here.

SQL for developing unmarked Chinook by run and life stage EXCLUDED FISH_23Dec2015.txt

jasmyace commented 8 years ago

I was able to successfully R-modify Connie's query, and incorporate it into this analysis. The query for non-valid catch undergoes the same sequence as applied to valid catch. This means that the plus-count algorithm divvies up Unassigned fish as it always has.

I did make one fundamental change to Connie's query: in query 3, I changed INTO TempSumUnmarkedByTrap_Run_Final to INTO TempSumUnmarkedByTrap_Run_X_Final.

Here is the updated graphical png of the example presented earlier in this Issue. You can see that the two empty data periods around 10 Jan and 03 Feb have been filled in with data. Note that also, one parr was found late in the season.

new_lar_2016-01-05_16-25-23_size_by_date

Finally, I did make one change to one of the two csvs output via this report -- I added a column, entitled fishing.status. This reports which of the fishing instances were from valid fishing, and those that were from non-valid fishing, i.e., the ones for which Connie's new query added data. This can be deleted / amended if desired.

To check things out, I ran this report over the big-looper; the entirety of its completed output can be found in the (new?) For Doug folder on the hidden FTP site. Let me know if you have any other questions, or if something needs to be modified.

Also let me know if it seems like some files don't make it over -- the little window is telling me it will take 80 minutes to transfer the files. This is a lot less output than I have moved before, so maybe the connection is poor.

ConnieShannon commented 8 years ago

Jason, you were right the sql should be "INTO TempSumUnmarkedByTrap_Run_X_Final". The corrected sql document is attached. SQL for developing unmarked Chinook by run and life stage EXCLUDED FISH_23Dec2015.txt

dougthreloff commented 8 years ago

Jason:

I have looked at the posts regarding Task 2.8. If I understand what you have done, it appears you have largely addressed this task. This will be a task where I will need to get the revised R code to make some final validation checks, however. So I may revisit this issue again if I find a bug, but I don’t expect to find any.

You have a post where you state “Finally, I did make one change to one of the two csvs output via this report -- I added a column, entitled fishing.status. This reports which of the fishing instances were from valid fishing, and those that were from non-valid fishing, i.e., the ones for which Connie's new query added data.” I think you have a good idea, but I would ask that you change the two category names you created to reflect one of the names in the Plot Weekly Effort Over Time report [these were: (a) Trap(s) were not fished, (b) Fishing was successful (i.e., traps were deployed and the “Include Data in Analyses?” field =Yes), and (c) Fishing was unsuccessful (i.e., traps were deployed and the “Include Data in Analyses?” field =No).] Implementing that suggestion will make the terminology more consistent in the platform. In this case, the new CSV field you created would have an entry of “Fishing was unsuccessful” in cases where fork lengths had “Include Data in Analyses?” field =No.

Other than the change in the paragraph above, I think you are done with this task.

Thanks.

Doug

jasmyace commented 8 years ago

[content moved to Task 2.9]

jasmyace commented 8 years ago

I have tested this code, via the Big Looper and the new catch function. Note this means that this report utilizes the correction for half-cone fish. Let me know if you want something other than this.

These reports have been run for all the Big Looper test criteria, for each run, and uploaded to the For Doug folder on the ftp site.

tmcd82070 / CAMP_RST

SOW 2 - Task 2.8: Modify Plot Fork Length Report #30