pfmc-assessments / PacFIN.Utilities

R code to manipulate data from the PacFIN database for assessments
http://pfmc-assessments.github.io/PacFIN.Utilities
Other
7 stars 2 forks source link

Develop smarter way to check Oregon number of sex samples by sample number #43

Open chantelwetzel-noaa opened 3 years ago

chantelwetzel-noaa commented 3 years ago

Currently, within the EF1_Denominator function called from getExpansion_1 does an internal check for Oregon data to determine if the number of samples by sex matches the number recorded in FEMALE_NUM, MALE_NUM, or UNK_NUM. However, if you do some external QAQC and remove ages or lengths that do not seem plausible, the EF1_Denominator check of these columns will fail. Either need to update check for OR in EF1_Denominator check to be smarter, create a function to recalculate and overwrite the MALE_NUM, FEMALE_NUM, or UNK_NUM based on removed samples (not sure how this would then impact EXP_WT values), or recommend that no records be removed from the data set.

kellijohnson-NOAA commented 3 years ago

Is is that EF1_Denominator needs to be changed or that something else needs to be done when people remove lengths or ages because it affects samples sizes and weights of those samples for things are calculated later on. So, I would say that the function did its job in that it recognized this.

Maybe there needs to be a helper function for users, remove_bds that updates other portions of the data set and removes a row? I am just trying to brainstorm. Would you ever want to remove a length but not the age from that lengthed fish? This is where it gets tricky b/c the sample weights are used to expand the comp and if you want the length but you don't want the age then the sample weights are wrong. Which is why I have _l and _a values when weighting.

chantelwetzel-noaa commented 3 years ago

If you remove data that are from Oregon the EF1_Denominator function stops with an error message based on the checks from line 111-144. The check is correct because the values in NUMBER_X columns do not match the internally calculated samples (because of the removed data). There is no easy way to get around this error without changing the NUMBER_X columns by hand for all the SAMLE_NOs where they do not match the internal check, so it forces the user to updated the NUMBER_X values by hand for all the SAMPLE_NO where they do not pass the internal check or retain length or age data that do not look plausible.

When I find a length or age that does not seem plausible, I opt to remove the entire record. Typically the number of records removed are a very small percentage of total good records. I just did a check where if I just replace both the length and age in the "bad" records with NAs the first stage expansion now works. If this is a better approach we may want to clarify guidance. If there are particular concerns with replacing both the length and age with NAs due to how the expansions are calculated we can revisit this option.

kellijohnson-NOAA commented 3 years ago

Putting in NA would expand the sample using the weight of the fish that has an NA but it wouldn't actually be providing any information to the composition. So your ending composition would essentially be "upweighted" based on the amount of information you are putting in for that sample relative to other samples.

On Fri, Jan 15, 2021 at 12:01 PM Chantel Wetzel notifications@github.com wrote:

If you remove data that are from Oregon the EF1_Denominator function stops with an error message based on the checks from line 111-144. The check is correct because the values in NUMBER_X columns do not match the internally calculated samples (because of the removed data). There is no easy way to get around this error without changing the NUMBER_X columns by hand for all the SAMLE_NOs where they do not match the internal check, so it forces the user to updated the NUMBER_X values by hand for all the SAMPLE_NO where they do not pass the internal check or retain length or age data that do not look plausible.

When I find a length or age that does not seem plausible, I opt to remove the entire record. Typically the number of records removed are a very small percentage of total good records. I just did a check where if I just replace both the length and age in the "bad" records with NAs the first stage expansion now works. If this is a better approach we may want to clarify guidance. If there are particular concerns with replacing both the length and age with NAs due to how the expansions are calculated we can revisit this option.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/nwfsc-assess/PacFIN.Utilities/issues/43#issuecomment-761168924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LCFGQ6BP7VSJ7H4TG2J3S2CNLBANCNFSM4WENFSFQ .

-- Kelli Faye Johnson, PhD Research Fish Biologist Northwest Fisheries Science Center National Marine Fisheries Service 2725 Montlake Boulevard East Seattle, Washington 98112 (206) 860-3490 kelli.johnson@noaa.gov

chantelwetzel-noaa commented 3 years ago

Hmm. If the number length and ages that are being replaced as NA is small relative to the total samples by species the impact of "upweighting" would like be small (assuming you are not removing a bunch of lengths/ages from a single SAMPLE_NO) but conversely the impact of leaving in these "bad" records also would be minimal. Need to think a bit more on how to treat this type of data. Thanks for all the input @kellijohnson-NOAA