4 values returned for one observation

BlaiseKelly commented 1 year ago

I was downloading some data for 2012 and noticed there were 4 values for every hour. Each value is different.

dat <- get_saq_observations(site = 'gr0027a', start = '2012-07-01', end = '2012-07-15', variable = 'o3')

Also for this site, but for no2 two values returned. Expecting only one for each hour for both species.

dat_2 <- get_saq_observations(site = 'gb0002r', start = '2012-07-01', end = '2012-07-15', variable = 'no2')

skgrange commented 1 year ago

Hello Blaise, I have had a look and the observations are not true duplicates here. For the ozone example, there can be many summaries per date (including hourly means, daily means, and eight-hour means). The summary type is stored in the summary variable. Below is an example of how to decode the summaries, but generally, hourly means are desired, so a filter can be applied to summary with the integer key of 1. The NO2 example is the same, but only two different types of summaries are accessible for this pollutant. I hope that helps and it is clear. Enjoy! Stuart.

# Load packages
library(dplyr)
library(saqgetr)

# Get summary keys
data_summary_keys <- get_saq_summaries()

# Get ozone observations
data_ozone <- get_saq_observations(
  site = "gr0027a", 
  variable = "o3",
  start = "2012-07-01", 
  end = "2012-07-15"
)

# Join the decoded versions of the summary integers
data_ozone_join <- data_ozone %>% 
  left_join(data_summary_keys, by = "summary") %>% 
  arrange(date)

# What do we have?
data_ozone_join %>% 
  distinct(variable,
           summary,
           averaging_period)
#> # A tibble: 5 × 3
#>   variable summary averaging_period
#>   <chr>      <int> <chr>           
#> 1 o3            20 day             
#> 2 o3            21 dymax           
#> 3 o3             1 hour            
#> 4 o3           101 8hour           
#> 5 o3           101 hour8

# Usually, hourly observations are desired and they are represented with 1

# Check if we can pivot, a good check for duplicate observations
data_ozone_join %>% 
  filter(summary == 1L) %>% 
  select(date,
         date_end,
         site,
         variable,
         value) %>% 
  tidyr::pivot_wider(names_from = variable)
#> # A tibble: 335 × 4
#>    date                date_end site       o3
#>    <dttm>              <dttm>   <chr>   <dbl>
#>  1 2012-07-01 00:00:00 NA       gr0027a    98
#>  2 2012-07-01 01:00:00 NA       gr0027a    98
#>  3 2012-07-01 02:00:00 NA       gr0027a    99
#>  4 2012-07-01 03:00:00 NA       gr0027a    98
#>  5 2012-07-01 04:00:00 NA       gr0027a    97
#>  6 2012-07-01 05:00:00 NA       gr0027a    97
#>  7 2012-07-01 06:00:00 NA       gr0027a    98
#>  8 2012-07-01 07:00:00 NA       gr0027a   100
#>  9 2012-07-01 08:00:00 NA       gr0027a   101
#> 10 2012-07-01 09:00:00 NA       gr0027a   103
#> # … with 325 more rows

# Filter and use for analysis
data_ozone_hour <- data_ozone %>% 
  filter(summary == 1L)

BlaiseKelly commented 1 year ago

Very clear - thanks Stuart!

skgrange / saqgetr

4 values returned for one observation #8