revise commercial comp data

iantaylor-NOAA commented 3 years ago

I just noticed that the length comp data from PacFIN looks horribly spiky (as in figure below), suggesting an issue with the expansions. https://github.com/iantaylor-NOAA/Lingcod_2021/blob/main/data-raw/lingcod_PacFIN_BDS.R

Thankfully @chantelwetzel-noaa is taking a look at what might have gone wrong.

chantelwetzel-noaa commented 3 years ago

Ultimately, the issue with the spiky comps appear to be related to a bimodality of final sample size calculations within a year where the quantile of final sample sizes are generally low values but within each year there are final sample sizes above the 75% quantile:

boxplot_final_expansion_lengths

In comparison, looking at Dover sole, a well-sampled species the range of final samples sizes per year looks like this:

dover_length_final_expansion

The range of expansion weights from lingcod did not appear to be greater than Dover sole (95% quantile of 1st or 2nd stage expansion) but lingcod did appear to have more at the extremes. As example from the 2nd stage expansion there were 33,572 expansions equal to 1.0 and nearly 6,000 at values > 30. Calculating the resulting final sample size (exp_1 * exp_2) there are a total of 6,782 records with a final sample size of 1.0 out of 93,678 records. In comparison, Dover sole had only 104 records with a final sample size of 1.0 out of 227,326. This divergence is likely due the higher probability of lower number of lingcods observed in a trip but also due to missing values needed in the 1st and 2nd stage expansion process. There was a large number of Washington records that had NAs in both the EXP_WT and the RWT_LBS columns, 5,025. In this case, the 1st stage expansion is set to 1.0:

PacFIN_exp1

There are likely more factors causing select length bins by year to be expanded heavily relative to other lengths in the same year, but this is a starting point for deeper investigations to be done later.

iantaylor-NOAA commented 3 years ago

Sorry for the very slow turn-around on this pressing issue.

Results of some investigations

It turns out that the very spiky patterns which caught my eye were not due to the expansions at all, rather the choice we may to include composition vectors of unsexed fish as well as sexed fish, rather than excluding the unsexed fish or applying a sex ratio to include them with the sexed fish.

The following plots show the length comps for the commercial fixed-gear fishery in the north model

all comp vectors included (expanded at left, unexpanded at right)

input N (number of trips) > 10 for all comp vectors (expanded at left, unexpanded at right)

Comparison above and below (with and without filtering for N > 10) leads me to conclude that the vast majority of the jaggedness is due to the vectors of unsexed fish with small sample sizes. That is, if you only sample 2 unsexed fish in a year, you get 2 spikes, regardless of the expansion factor. With those spikes present, the ylim goes to the default max of 0.4 and the differences in the distributions of sexed fish are harder to see.

However, comparison left to right (expanded vs. unexpanded) in the filtered data (lower pair above), leads me to conclude that the expansion is indeed adding more subtle jaggedness (e.g. the males in 1996 or the spike of large females in 1998). This makes me think that we're better off with unexpanded data.

As for filtering the small sample size vectors, here's one more view showing all length comps in the bottom trawl fishery in the north model as bubbles. unexpanded comps: all comp vectors at left, filtered for N > 10 on the right

This leads me to conclude that the data from the 1960s, mostly unsexed fish, is still adequate to show a valuable signal about a large cohort moving through the population in spite of the N < 10. Also, a bunch of small unsexed fish in 2009 got filtered out from the right-hand plot, resulting in a loss of useful information about the 2008 cohort. So that cutoff seems too strict.

Conclusions

I've been staring at all this for too long, but here's what I propose:

Use unexpanded comps
Include all vectors, including those with small sample sizes, as any cutoff may remove useful samples.
- Finding a cutoff that removes noise while keeping the signal seems challenging. In hindsight, I might support using a sex ratio to assign the unsexed fish to female and male vectors in years where unsexed fish are a small fraction of the total, but it's too late for that now.
- Luckily the small sample size vectors seem to have very little impact on the Francis weighting judging from the almost equal suggested weights for models with and filtering (and with and without expansions). The small sample size vectors may impact the D-M weighting leading to the estimates close to 100% weight for many fleets, but if so, sensitivity to data filtering seems like a bad property and we're better off with Francis anyway.
Tinker with the visualization to make it easier to see what's going on without being distracted by the small samples of unsexed fish.
- This can include using r4ss::SS_plots(..., comp.yupper = 0.25) for the length comp plots so that the flatter distributions associated with higher sample sizes are easier to see
- It could also include providing new controls for the transparency of the three distributions so that the unsexed fish could be fade away a little bit

I could be swayed to go in a different direction on any of these points, so please comment on any guidance.

Models

I'll upload model files in a minute so that others can look in more depth if they wish. The models IDs for the 4 cases discussed above are

2021.[area].004.003_new_data_fix1 = expanded with all data
2021.[area].004.004_new_data_fix2 = unexpanded with all data
2021.[area].004.005_new_data_fix1 = expanded with filtered data
2021.[area].004.006_new_data_fix2 = unexpanded with filtered data

If we go with my proposal above, we would use the data file from model 004.004_new_data_fix2 = unexpanded with all data. These choices probably matter more for the north than the south because the south model is dominated by rec fleets not commercial (although the south has lots of small sample size comps).

I'll get back to selectivity blocks (#58) and other control file changes (#59) in the morning, which combined with whatever we choose for the commercial comp data will hopefully (finally) get us a decent starting point for further model exploration.

brianlangseth-NOAA commented 3 years ago

@iantaylor-NOAA If it really only matters visually and not from a modeling standpoint (i.e. weighting isn't affected - and assuming selectivity isn't affected either) why dont we just keep as is? We could present the composition data without unsexed comps for a visual cue for just male and females if someone asks. That would seem to be the path of least resistance.

iantaylor-NOAA commented 3 years ago

Good point @brianlangseth-NOAA. I suspect that I was biased toward getting SOMETHING out of the work invested in this exploration, but indeed it's unlikely to be a big deal and now easy to do a sensitivity to the alternative(s) (tagging #43 so this shows up there).

@kellijohnson-NOAA, what do you think?

kellijohnson-NOAA commented 3 years ago

I like the idea of keeping the information in an unexpanded comp. At least that way we are going in where they can see the fits to the data and can tell us they want something else done rather than hiding stuff.

brianlangseth-NOAA commented 3 years ago

@kellijohnson-NOAA I was not suggesting we hide things. We should include the unsexed comps in the figure if we use them. My comment was that if someone questions the spikiness during the review, we have an answer for them and can then show a figure with just sexed to better convey the material of the sexed comps.

kellijohnson-NOAA commented 3 years ago

Poor choice of words on my part, sorry if it came across that way.

iantaylor-NOAA commented 3 years ago

Putting aside the question of how to present the information about the unsexed comps, @kellijohnson-NOAA are you saying you prefer unexpanded (thus avoiding the problems with final sample sizes described by @chantelwetzel-noaa above)?

Both data files have been generated so it's no more work one way or the other, we just need to pick something to go forward with for now with the option to switch back later and/or run a sensitivity to the other option.

kellijohnson-NOAA commented 3 years ago

I prefer the unexpanded, unfiltered option. I don't think that PacFIN.Utilities was designed to handle M,F,and unsexed because before all of the unsexed were thrown out if they were not partitioned to a sex using a ratio.

melissahaltuch-NOAA commented 3 years ago

Another option that we've used in the past is to just use the early comps, where the data are mostly unsexed, as an unsexed comp. Then use the male/female where the data are sexed and there are few unsexed fish that can be partitioned using sex ratio.

On Fri, Jun 18, 2021 at 8:12 AM Kelli Johnson @.***> wrote:

I prefer the unexpanded, unfiltered option. I don't think that PacFIN.Utilities was designed to handle M,F,and unsexed because before all of the unsexed were thrown out if they were not partitioned to a sex using a ratio.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/iantaylor-NOAA/Lingcod_2021/issues/69#issuecomment-864108655, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFP5YEB5OCIBLDW72BKPDE3TTNO6DANCNFSM462NWO7Q .

-- Melissa A. Haltuch, Ph.D Pronouns: she/her/hers Acting Fish Ecology Division Director, NWFSC, NOAA Fisheries Research Fishery Biologist, NOAA Fisheries

*University of Washington, School of Aquatic and Fishery Science, Associate Affiliate @. @.> 206.860.3480

iantaylor-NOAA commented 3 years ago

Thanks everyone for the feedback.

The separate treatment of early and late as suggested by @melissahaltuch-NOAA is reasonable but would be more work at this stage than just moving forward with the unexpanded, unfiltered data. For now I think we can move forward with the unexpanded, unfiltered data as represented in models 004.004 and 004.007 (with WA rec CPUE update).

We can revisit this as a sensitivity if we wish (#43).

pfmc-assessments / lingcod