pbs-assess / pacea

An R package to house Pacific Region ecosystem data to help facilitate an ecosystem approach to fisheries.
Other
14 stars 0 forks source link

Add a criteria for how much data there should be when calculating daily average SST from high resolution buoy data. #26

Open andrew-edwards opened 1 year ago

andrew-edwards commented 1 year ago
andrew-edwards commented 10 months ago
andrew-edwards commented 8 months ago

Well, just using days for which we have at least one hourly value buoy sst value every two hours, is way more fiddly than expected.

I'm going with this algorithm (for each buoy), which will be the same for 10-minute resolution data or hourly resolution data:

  1. Make a 'grid' of two-hour intervals from the first to last date of sst data.
  2. Each time of an sst measurement has to be in exactly one interval of the grid.
  3. For each two-hour interval, take the mean of measured sst within that interval. This will work for hourly and 10-minute data. sst shouldn't change dramatically within a two-hour interval, so it should be fine if occasionally, say, there are 10-minute values just in the first 25 minutes of a two-hour interval - no big deal.
  4. Then for each day, take the daily mean of the two-hour sst means.
  5. Keep track of how many two-hour values there are in each day, and discard any daily means below a cut off value as being unreliable. Am thinking if there are less than 10 two-hour values available (out of the possible 12=24/2), then do not provide a mean daily sst for that day. Can maybe test sensitivity to that value of 10.

Any thoughts? @schckngs ? @travistai2 ?

andrew-edwards commented 8 months ago
travistai2 commented 8 months ago

That makes sense. I guess you would essentially be 'binning' the values by 2-hour intervals and finding the 2 hour mean. Then filtering out days with less than 10 values to estimate the daily mean?

andrew-edwards commented 8 months ago

Done all the above (by the commit below), and went with excluding any days with daily fluctuations >5 degC.

andrew-edwards commented 8 months ago

Comparing the buoys vignette with the version before recent commits (saved locally as buoys-2023-08-23-before-two-hour-5degC.html), the above refinements have actually removed some outlier-ish values (or ones that were the max/min for particular days), and maybe more. Also leaves a few more gaps in data for C46303 and C46304, and omits most of 2023 (thus far) for C46145 - could dig into if needed. Maybe relax the two-hour criteria for the latter, as the original 2023 line looked okay, though did fluctuate somewhat. Leave as a possible for the future.

andrew-edwards commented 8 months ago

Collating emails here: Andy: Currently the flags we use are only for the earlier data, not the more recent Ocean Protection Plan data stream that started in 2019. Originally I wasn't using any flags in that data (following Andrea again), but looking at Kellogg et al. made me think that maybe we should, and the flags might help remove the spurious early big fluctuations in temperature that we looked at at IOS.

But, the only flags in the data are these:

> table(opp_data_raw$avg_sea_sfc_temp_pst10mts_qa_summary)

    -10      -1       0      20     100 
  51257    3363     789   37167 1241984

Do you know what these codes mean (-10, -1 etc.; 51257 is the count)?

They are different codes from Kellogg et al. Presumably 100 is good (and is the value for 92.5% of the 10-minute measurement).

The other flags (avg_sea_sfc_temp_pst10mts_data_flag , avg_sea_sfc_temp_pst10mts_1_data_flag, and avg_sea_sfc_temp_pst10mts_1_qa_summary) are all NA's.

Maybe I should try just keeping the 100's for now and see if that helps (excluding the spurious looking data), but something a bit more concrete would help.

andrew-edwards commented 8 months ago

Andrea @schckngs:

Yes, it’s too bad the OPP data stream doesn’t have this! Maybe this is helpful – I plotted 2022 using the “Hakai q’c’d” records

DFO_MEDS_BUOYS_2022

as well as the ECCC flags .

ECCC_MSC_BUOYS_2022

You can see that the value 100 is likely their value for “good”, 20 probably means “suspicious” and “0” is over a threshold.

However, as we can directly compare this to the Hakai flags, what is flagged as 20 is generally “good” in their dataset. Similarly, looking at C46181, it looks like they have capped the temperature values at 20 C for that buoy which gives it a flag of “0”.

So long story short – I completely ignore the ECCC flags for these reasons! Please feel free to double-check my plots, it also seems that ECCC removes some of the very “jumpy” values (e.g. compare C46206). Basically my process is to use the Hakai q/c’d data as much as possible, using their flags, then “fill in” with ECCC where the other dataset does not exist.

andrew-edwards commented 8 months ago

Andrea: The codes info would be on the ECCC site somewhere (if at all) as they get the data directly from there. Just confirming, I looked through my emails from a year ago – I had emailed someone at ECCC about this as well as this weird data spike issue [see plot below, and Andy's plots in data-raw/buoys/buoy-sst.R], see the consistently “higher” points once per day but I never got a response. We could try emailing ECCC again. ECCC_MSC_BUOYS_C46303

Andy: spurious data for C46304 is end of Sept 2019: image Interestingly, the daily average looks okay, because these extremes kind of get averaged out.

So, just keeping the qa_summary = 100 values (i.e. the best quality only) does NOT get rid of the problem with the wild swings in temperature at the start of buoy 46304 (likely the buoy measuring air temperature; the plot in my previous email does not change). So whatever the flags for these data do actually mean, they don't help.

So we have to come up with a criteria – if the daily range of a buoy is > X then exclude that day. So what would X be?

Here's a histogram of daily ranges, which looks as expected (units are degC):

image

So we want to exclude the right-hand tail above X. The counts in each bin are:

midpoints counts 0.5 7375 1.5 2333 2.5 1158 3.5 480 4.5 189 5.5 61 6.5 27 7.5 9 8.5 5 9.5 0 10.5 1

Question: any intuition as to what X should be? Maybe I'll set it slightly arbitrarily to 5 for now, and then can easily change it.

Thanks for all this, bit of a stream of consciousness I know. Also emphasises why no-one is yet using such data in their stock assessments!

andrew-edwards commented 8 months ago

Andrea: I feel like a middle ground on the q/c front might be good – for example the spurious values you point out at the beginning of that buoy record are ~8-10 degrees change over the course of a day. However, it might be better to bug ECCC directly about it, or set the start date for that buoy to be later, so that we don’t miss real events that may manifest large temp fluctuations (like storms or heatwaves). Similarly those weird “bumps” once a day I’ll revive the thread with them about, as that impacts all downstream users of this data.

The code for the plots was off the top of my head from a while ago, but something like this should work:

library(dplyr)
library(ggplot2)
library(lubridate)

sstdata %>%
    ggplot(aes(x = time, y = SSTP, colour = as.factor(flaglayer))) +
    geom_point() +
    facet_wrap(~STN_ID)

Charles: How about inventing a new flag with a name like ‘deployment start-up’ And then just flag the annoying part at the beginning.

Andy: Thanks – I did start trying the manual flagging, but it can't really be automated, and seems to happen more than once (one buoy had a short period with no data, then fluctuations again). So I've stuck with the 5degC cut-off for now.

Interestingly, even for that wild 10degC swing in a day, when the temps are averaged over the day, the average is pretty much the same as the daily average a few days later (for days that did not have wild swings); i.e. the wild swing during the day kind of got averaged out anyway, and it doesn't look like a weird outlier when you look at just the daily averages.

Maybe that was luck in that the average of the daily air temperature (if our guessing is correct and the sensor was on the deck of a ship) was pretty close to the average SST anyway.

andrew-edwards commented 8 months ago

Andrea: Hey folks, here are a couple thoughts / plots to add to the convo.

I feel 5C is too small – however if you’d like to pursue making this flag, maybe increasing the range or making it dependent on day of year / location would work. I understand that the SST dataset in this package should be more accurate than what we have on our monitoring plots as it will be used with further analysis/stock assessment.

Attached are some quick plots of mean daily temp coloured by daily temperature range (Days > 5 C range circled in black in time series), as well as a range-vs-mean coloured by the day of year. We can see (as we’d expect) it’s seasonal and location dependent. For example, there are no records with daily ranges outside of 5 degrees at West Moresby(C46208), whereas many days would be likely erroneously removed each summer for Sentry Shoal (C46131) and Halibut Bank (C46146). Buoy_temperature_ranges

Buoy_temperature_ranges_doy

andrew-edwards commented 8 months ago

Also see #33