Open andrew-edwards opened 1 year ago
Well, just using days for which we have at least one hourly value buoy sst value every two hours, is way more fiddly than expected.
I'm going with this algorithm (for each buoy), which will be the same for 10-minute resolution data or hourly resolution data:
Any thoughts? @schckngs ? @travistai2 ?
That makes sense. I guess you would essentially be 'binning' the values by 2-hour intervals and finding the 2 hour mean. Then filtering out days with less than 10 values to estimate the daily mean?
Done all the above (by the commit below), and went with excluding any days with daily fluctuations >5 degC.
Comparing the buoys vignette with the version before recent commits (saved locally as buoys-2023-08-23-before-two-hour-5degC.html), the above refinements have actually removed some outlier-ish values (or ones that were the max/min for particular days), and maybe more. Also leaves a few more gaps in data for C46303 and C46304, and omits most of 2023 (thus far) for C46145 - could dig into if needed. Maybe relax the two-hour criteria for the latter, as the original 2023 line looked okay, though did fluctuate somewhat. Leave as a possible for the future.
Collating emails here: Andy: Currently the flags we use are only for the earlier data, not the more recent Ocean Protection Plan data stream that started in 2019. Originally I wasn't using any flags in that data (following Andrea again), but looking at Kellogg et al. made me think that maybe we should, and the flags might help remove the spurious early big fluctuations in temperature that we looked at at IOS.
But, the only flags in the data are these:
> table(opp_data_raw$avg_sea_sfc_temp_pst10mts_qa_summary)
-10 -1 0 20 100
51257 3363 789 37167 1241984
Do you know what these codes mean (-10, -1 etc.; 51257 is the count)?
They are different codes from Kellogg et al. Presumably 100 is good (and is the value for 92.5% of the 10-minute measurement).
The other flags (avg_sea_sfc_temp_pst10mts_data_flag
, avg_sea_sfc_temp_pst10mts_1_data_flag
, and avg_sea_sfc_temp_pst10mts_1_qa_summary
) are all NA's.
Maybe I should try just keeping the 100's for now and see if that helps (excluding the spurious looking data), but something a bit more concrete would help.
Andrea @schckngs:
Yes, it’s too bad the OPP data stream doesn’t have this! Maybe this is helpful – I plotted 2022 using the “Hakai q’c’d” records
as well as the ECCC flags .
You can see that the value 100 is likely their value for “good”, 20 probably means “suspicious” and “0” is over a threshold.
However, as we can directly compare this to the Hakai flags, what is flagged as 20 is generally “good” in their dataset. Similarly, looking at C46181, it looks like they have capped the temperature values at 20 C for that buoy which gives it a flag of “0”.
So long story short – I completely ignore the ECCC flags for these reasons! Please feel free to double-check my plots, it also seems that ECCC removes some of the very “jumpy” values (e.g. compare C46206). Basically my process is to use the Hakai q/c’d data as much as possible, using their flags, then “fill in” with ECCC where the other dataset does not exist.
Andrea: The codes info would be on the ECCC site somewhere (if at all) as they get the data directly from there.
Just confirming, I looked through my emails from a year ago – I had emailed someone at ECCC about this as well as this weird data spike issue [see plot below, and Andy's plots in data-raw/buoys/buoy-sst.R
], see the consistently “higher” points once per day but I never got a response. We could try emailing ECCC again.
Andy: spurious data for C46304 is end of Sept 2019: Interestingly, the daily average looks okay, because these extremes kind of get averaged out.
So, just keeping the qa_summary = 100 values (i.e. the best quality only) does NOT get rid of the problem with the wild swings in temperature at the start of buoy 46304 (likely the buoy measuring air temperature; the plot in my previous email does not change). So whatever the flags for these data do actually mean, they don't help.
So we have to come up with a criteria – if the daily range of a buoy is > X then exclude that day. So what would X be?
Here's a histogram of daily ranges, which looks as expected (units are degC):
So we want to exclude the right-hand tail above X. The counts in each bin are:
midpoints counts 0.5 7375 1.5 2333 2.5 1158 3.5 480 4.5 189 5.5 61 6.5 27 7.5 9 8.5 5 9.5 0 10.5 1
Question: any intuition as to what X should be? Maybe I'll set it slightly arbitrarily to 5 for now, and then can easily change it.
Thanks for all this, bit of a stream of consciousness I know. Also emphasises why no-one is yet using such data in their stock assessments!
Andrea: I feel like a middle ground on the q/c front might be good – for example the spurious values you point out at the beginning of that buoy record are ~8-10 degrees change over the course of a day. However, it might be better to bug ECCC directly about it, or set the start date for that buoy to be later, so that we don’t miss real events that may manifest large temp fluctuations (like storms or heatwaves). Similarly those weird “bumps” once a day I’ll revive the thread with them about, as that impacts all downstream users of this data.
The code for the plots was off the top of my head from a while ago, but something like this should work:
library(dplyr)
library(ggplot2)
library(lubridate)
sstdata %>%
ggplot(aes(x = time, y = SSTP, colour = as.factor(flaglayer))) +
geom_point() +
facet_wrap(~STN_ID)
Charles: How about inventing a new flag with a name like ‘deployment start-up’ And then just flag the annoying part at the beginning.
Andy: Thanks – I did start trying the manual flagging, but it can't really be automated, and seems to happen more than once (one buoy had a short period with no data, then fluctuations again). So I've stuck with the 5degC cut-off for now.
Interestingly, even for that wild 10degC swing in a day, when the temps are averaged over the day, the average is pretty much the same as the daily average a few days later (for days that did not have wild swings); i.e. the wild swing during the day kind of got averaged out anyway, and it doesn't look like a weird outlier when you look at just the daily averages.
Maybe that was luck in that the average of the daily air temperature (if our guessing is correct and the sensor was on the deck of a ship) was pretty close to the average SST anyway.
Andrea: Hey folks, here are a couple thoughts / plots to add to the convo.
I feel 5C is too small – however if you’d like to pursue making this flag, maybe increasing the range or making it dependent on day of year / location would work. I understand that the SST dataset in this package should be more accurate than what we have on our monitoring plots as it will be used with further analysis/stock assessment.
Attached are some quick plots of mean daily temp coloured by daily temperature range (Days > 5 C range circled in black in time series), as well as a range-vs-mean coloured by the day of year. We can see (as we’d expect) it’s seasonal and location dependent. For example, there are no records with daily ranges outside of 5 degrees at West Moresby(C46208), whereas many days would be likely erroneously removed each summer for Sentry Shoal (C46131) and Halibut Bank (C46146).
Also see #33
[x] When calculating daily average SST temperatures from higher resolution values in
data-raw/buoys/sst-data.R
, the average will be biased if, say, only values during the night time are collected.[x] Also should not calculate average for a final incomplete day.
[x] When have plotting functions, delve into the values a bit more to check for outliers that should be excluded based on some criteria.