Use of get_saq_observations() for parallel data + use of 'valid_only'

SverreSolberg commented 3 years ago

Hi Stuart, two questions to the get_saq_observations():

I encountered an issue when extracting data for PM10 for a station (gb0036r in 2015) that has both daily and hourly resolved data. When using the get_saq_observations() I get a tibble containing 9101 observations and it turns out that the first 365 (or actually 364, one missing) data refer to the daily based obs and the rest refer the hourly, all packed together. They can be separated though if using the saq_clean_observations() with spread=T, but I thought it was a bit strange that they were packed into one variable at first. I think it could have been useful with a possibility to specify the temporal resolution (i.e. the 'period' as given by EEA) inside the get_saq_observations(), e.g. a parameter call like 'period = hour' inside that function. Just a wish.
Second question regards the 'valid_only'. It's not completely clear to me what this does. If I set this to F, do I then risk getting actual data values that should not be used or do I just get all the 'NA's for the times with invalid data? Important to know. It seems to me that the latter is the correct, but it would be good to get it confirmed (I can live with the NAs but dont want the erroneous observational values of course). Thanks, Sverre

skgrange commented 3 years ago

Hello Sverre, You are correct, the get_saq_observations function returns a single table for the given site and year combinations. It would be rather easy to include another argument for filtering to certain aggregation periods, but at this stage this can be easily done after the importing stage. There is merit keeping the importing and filtering steps separate too to allow for full transparency and control by the data user. This is especially important when there are many periods for the same variable.

The European air quality data source contains a mechanism to code observations which have various forms of validity. Validity is a tricky topic and is done rather differently among the different states which report data to the European Commission. There are three integers which are considered valid: 1, 2, and 3, but we will also need to add a missing value too because some states do not apply the same rigorous standard to this particular piece of reporting. Generally, if you are doing a serious analysis, invalid observations should be removed or set to missing.

When setting the valid_only argument to TRUE, observations with the three integers representing a valid observation and NA for validity are kept. Therefore, you will generally have fewer observations when using this argument. You could also apply a conditional operation to set value to NA if validity is not 1, 2, 3, or NA if you do not want rows being removed. Unfortunately, because of the variable use of the validity variable across Europe, it is always good to test if this logic makes sense. Because I know the Harwell site somewhat well, this will not be a problem in your case.

Below is my R working to demonstrate a few of the things I have said above.

# Load packages
library(dplyr)
library(saqgetr)

# What site is of interest? 
get_saq_sites() %>% 
  filter(site == "gb0036r") %>% 
  select(1:11)

# Get summary integers
data_summaries <- get_saq_summaries() %>% 
  filter(averaging_period != "var")

# Get validity integers
data_validity <- get_saq_validity()

# Get observations
data_harwell <- get_saq_observations("gb0036r", start = 2015, end = 2015)

# What do we have here? 
data_harwell %>% 
  distinct(site,
           variable,
           summary) %>% 
  arrange(summary) %>% 
  left_join(data_summaries, by = "summary") %>% 
  print(n = Inf)

# Types of validity? Use a single variable to show the different types
data_harwell %>% 
  filter(variable == "i_hexane_2_methylpentane") %>% 
  distinct(site,
           summary,
           variable,
           validity) %>% 
  left_join(select(data_validity, validity, valid, description), by = "validity") %>% 
  print(n = Inf)

# Filter to only hourly observations
data_harwell_hour <- data_harwell %>% 
  filter(summary == 1L)

# Filter to hourly and valid observations
data_harwell_hour_valid <- data_harwell %>% 
  filter(summary == 1L,
         validity %in% 1L:3L | is.na(validity))

Stuart.

SverreSolberg commented 3 years ago

Ok, many thanks. I will use this in my scripts. Sorry for continuing posts, but now it seems I have an issue with memory problems and this is probably unrelated to the issue above. I extract data for several thousand sites and get into trouble (in Linux) with out-of-memory. I have traced it down to the function get_saq_observations() that I have inside a for loop. Each time this is called the memory usage is increased by of the order of 5 Mb, so when extracting data for ~2000 sites, I hit the memory limit. Any idea why this function leads to an accumulation of memory? Sverre

skgrange commented 3 years ago

Hello Sverre, No problems. Within your for loop, are you exporting data or accumulating it in a list before binding after the loop? If you are accumulating 2000 sites worth of observations, I would not be surprised that you are running out of memory on a standard laptop or desktop computer. If you are exporting data at every iteration, you will be better off calling get_saq_observations with purrr::walk and returning NULL. The memory management should then take over when resources become constrained because unneeded objects generated in the function will be automatically discarded. Stuart.

SverreSolberg commented 3 years ago

Hi Stuart, no, we are not accumulating a long list. It seems that there is some kind of memory leak when using get_saq_observations. No problem for a modest number of calls, but leads to a crash when of the order of 1000s calls. The example code below shows the memory build up. I am afraid I have no experience with the walk package or the purrr lib and it's not clear to me how this could be used. Calling get_saq_observations with a vector of thousand sites would anyway not work, I guess.

library(saqgetr) library(pryr) library(tidyverse)

site = 'gb0036r' for (i in 1:100) { data_this_site = get_saq_observations(site, start = 2015, end = 2015, variable = 'o3', valid_only = T) print(paste("i = ",i,'memory used = ',mem_used()/1.e6,"Mb")) }

skgrange commented 3 years ago

Hello Sverre, Ok, it is not clear to me why you are using a loop here. Without pre-allocation, there maybe thousands of copies being made which are not being subjected to garbage collection.

I would keep it simple and simply pass the vector of 2000-ish sites to the get_saq_observations function. Here, I have filtered the processes and sites tables to include hourly ozone processes for the E2a data source and then used get_saq_observations. There are 2228 sites, it ran with no problems, the return was about 15 million observations, and the tibble was about ~880 Mb in memory. This sort of size should be easy to handle with most systems. However, if not, you could filter the site vector by country or some sort of chunk which is manageable for your system's resources. You can see however if you wanted to get a few more years of observations how much data can be returned with one function call. Below is my code:

# Thousands of calls test ---------------------------
# Load packages
library(dplyr)
library(saqgetr)

# Get helpers tables
data_sites <- get_saq_sites()
data_processes <- get_saq_processes()

# Filter process table to relevant ozone processes
data_processes_filter <- data_processes %>% 
  filter(variable == "o3",
         data_source == "aqer:e2a",
         period == "hour")

# Filter sites table too
data_sites_filter <- data_sites %>% 
  filter(site %in% unique(data_processes_filter$site))

# Get thousands of sites worth of data in one call
data_ozone <- data_sites_filter %>% 
  pull(site) %>% 
  get_saq_observations(
    variable = "o3", start = 2015, end = 2015, valid_only = TRUE, verbose = TRUE
  )

# Check the size
dim(data_ozone)

Stuart.

SverreSolberg commented 3 years ago

Hi Stuart, many thanks for all your help. Based on the "warning" about large amount of data in the manual, I didn't consider putting all sites into the call of get_saq_observations. That's the reason for the FOR loop. Now, I tested your script and on our Linux cluster the first 1000 sites required close to 4 Gb of memory, so obviously much more than on your system. But your code was faster than the FOR loop and it's doable with batches of 1000 sites and I could probably request somewhat more than 4Gb as well. The memory requirement is apparently scalable to the number of sites.

skgrange commented 3 years ago

Hello Sverre, Good to hear. The note in the manual was intended to let users know that it can be rather easy to hit memory issues quickly if there is not an appreciation on the number of observations which can be accessed. If you are on a well resourced cluster, it is easy to understand that R is not bothering to clean things up during execution, because there are no concerns about running out of memory. That being the case, just grab all the data you want with one function call! That keeps things super easy and simple for you! Stuart.

skgrange / saqgetr

Use of get_saq_observations() for parallel data + use of 'valid_only' #5