skgrange / saqgetr

Import Air Quality Monitoring Data in a Fast and Easy Way
GNU General Public License v3.0
9 stars 3 forks source link

saqgetr and lml discrepancy #12

Closed BlaiseKelly closed 10 months ago

BlaiseKelly commented 11 months ago

I noticed some discrepancy between the data from saqgetr and the Dutch observation network (LML operated by RIVM). The LML data is dated 07/2023 so potentially this is simply a case of Airbase not being updated with the adjusted data from RIVM, but the data from Airbase says it has been validated.

How often is Airbase updated? Are there multiple validation steps?

RIVM info on validation (not much detail) https://www.luchtmeetnet.nl/informatie/overige/validatie-data

Reprex below:

library(saqgetr)
library(dplyr)
library(lubridate)
library(threadr)
library(openair)
library(reshape2)

## import all netherlands sites
saq_sites_nl <- get_saq_sites() %>% 
  filter(grepl("nl0", site))

## get valid observations for 2022
saq_nl <- get_saq_observations(site = saq_sites_nl$site, variable = "pm2.5", valid_only = TRUE, tz = "UTC", start = "2022", end = "2022") %>% 
  select(date, site, saqgetr = value) 

## import csv, doesn't like header so go for row above
lml_dat <- read.table("https://data.rivm.nl/data/luchtmeetnet/Vastgesteld-jaar/2022/2022_PM25.csv", skip = 9, sep = ';')
## use first row
names(lml_dat) <- lml_dat[1,]

## LML is in CET winter time, convert to UTC
lml_nl <- lml_dat[-1,] %>% 
  mutate(date = ymd_hm(` Begindatumtijd`, tz = "UTC")-3600)

lml_nl_down <- lml_nl[,-c(1,2,3,4,5)]  %>% 
  melt('date') %>% 
  mutate(variable = gsub("NL01", "nl00", variable),
         variable = gsub("NL10", "nl00", variable),
         variable = gsub("NL49", "nl00", variable)) %>% 
  transmute(date, site = variable, lml = as.numeric(value))

## left join with saq first as it has fewer dates with data
saq_lml_nl <- left_join(saq_nl, lml_nl_down, by = c('date', 'site'))

Summarising the two datasets for each site


## summary stats
statz <- aqStats(saq_lml_nl, c('saqgetr', 'lml') ,type = "site")

## calculate daily means and number of days above 15
saq_lml_24h_exceed <- saq_lml_nl %>% 
  timeAverage("day", type = "site") %>% 
  group_by(site) %>% 
  summarise(saq_gt_15 = sum(saqgetr >= 15, na.rm = TRUE),
            lml_gt_15 = sum(lml >= 15, na.rm = TRUE)) %>% 
  left_join(saq_sites_nl, by = "site") %>% ## get site info
  select(site, site_type, site_area, saq_gt_15, lml_gt_15) %>%
  arrange(site_type, site_area) ## arrange by site type then site area

An example below for the site Vredepeel-Vredeweg NL00131 which is 1ug/m3 higher than lml from 01/01/2022 to 24/11/2022 16:00 then it is the same.

## Example of one site

## import background site Vredepeel-Vredeweg
saq_nl00131 <- get_saq_observations(site = "nl00131", variable = "pm2.5", valid_only = TRUE, tz = "UTC", start = "2022", end = "2022") %>% 
  select(date, saqgetr = value) 

## import csv, doesn't like header so go for row above
lml_dat <- read.table("https://data.rivm.nl/data/luchtmeetnet/Vastgesteld-jaar/2022/2022_PM25.csv", skip = 9, sep = ';')
## use first row
names(lml_dat) <- lml_dat[1,]

## convert to UTC
lml_nl00131 <- lml_dat[-1,] %>% 
  transmute(date = ymd_hm(` Begindatumtijd`, tz = "UTC")-3600,
            lml = as.numeric(NL10131))

## join them together
nl00131 <- left_join(saq_nl00131, lml_nl00131, by = 'date')

## plot full time series
threadr::time_dygraph(nl00131, c('saqgetr', 'lml'))

## plot summary
openair::timeVariation(nl00131, c('saqgetr', 'lml'))

example period of difference

skgrange commented 11 months ago

Hello Blaise, I do no transformations of the data that are supplied by European data sources. As you say, this has most likely occurred due to a change in the data that have been submitted, made publically available, and finally made available again via saqgetr.

An update of the near-real-time data (called the E2a data flow on the European side) is scheduled for the weekend. I will make sure that all years that are in this data flow (at least 2022 and 2023) are updated and exported to ensure everything from this data source is up to date. I will let you know early next when this is completed and you can run your test again. Hopefully, that change would have been propagated through the system by that time. Talk soon! Stuart.

BlaiseKelly commented 11 months ago

Thanks Stuart, I do understand saqgetr doesn't adjust the data in any way but I know you have a good understanding of the data sources. I will indeed try again next week. Much appreciated Blaise

skgrange commented 11 months ago

Hello Blaise, I hope your weekend was a good one. The updating programmes ran yesterday, so all E2a (near-real-time) data have been updated and exported. This will include the Dutch observations for 2022. These observations are publically accessible now. Could you please run your tests again? If the discrepancy still exists, this will be due to the data submission handled by the European Commission not being available or the submission by the Dutch authorities not being up to date. Let me know the result, I hope it is resolved! Have a great week, Stuart.

BlaiseKelly commented 10 months ago

Hi Stuart, Sorry for the slow response, I was on holiday for a couple of weeks. I tried this again and there is still the same difference. Is it maybe the case that the E2a data doesn't get updated by revisions made at a later date by countries? I will try to contact LML here in Netherlands and see what they say. Thanks for your help. Blaise

skgrange commented 10 months ago

Hello Blaise, Interesting. Ok, I am not sure how updates are handled on the European Commission's side, so it is certainly worth a follow-up. Moving data around can lead to differences if things are not updated completely which does happen with time series quite frequently. Let me know the situation when things become clear. Have a great week! Stuart.

mooibroekd commented 10 months ago

Hi Stuart and Blaise, We are currently investigating why there is a discrepancy between the airbase data and our monitoring data. The official repository of our network is currently up to date, so that might suggest that we haven't uploaded possible changes to airbase yet. I also see that our EU site codes are not carried over correctly in airbase, we will take a look at that too. As we have different monitoring networks in the Netherlands, there is always a possibility of overlap in the last three numbers of the site code. Hence, the first two numbers after the "NL" denote the different monitoring networks. Dennis

mooibroekd commented 10 months ago

Hi Stuart and Blaise, The validation processes for the Dutch monitoring network are (in short) as follows:

In the case of PM2.5 we have made some adjustments to the monthly preliminary data of 2022 during the yearly final validation of our data. These adjustments are the cause of the ~1 ug/m3 differences found. We do not mention these adjustments in our overview of made changes (Dutch only), as this document only details with changes after the final yearly validation and subsequent public release of data.

Data flow E2a is filled with data prior to the yearly validation and are therefore preliminary. Member States may update the information in data flow E2a, but this is not mandatory (see point 8 in this link.) For the Netherlands data flow E2a is not updated with the final validated data. These data are reported within data flow E1a.

Hence, subsequent updates of E2a data by this package will not include possible changes made during our yearly final validation. The data reported on our own repository does include these possible changes. Please note, at our repository we also make the preliminary monthly data for current year available in a different folder than the final data set.

@skgrange We might not be the only Member State that does not update E2a. Perhaps {saqgetr) should use a tiered approach in the sense that it uses the E2a data flow when E1a is not (yet) available?

skgrange commented 10 months ago

Hello All, Ok, that all makes sense. If the near-real-time data flow (E2a) does not require updating after the first transmission, discrepancies will occur once the source data undergo adjustments for various reasons.

The logic for updating my database is that the near-real-time observations from the E2a data flow are updated monthly for all member states and are made available with saqgetr. The deadline for the validated observations (E1a) is the end of September, and after the European Commission have performed their checks (which takes a couple of months), the previous year's E2a observations are purged and the E1a observations are inserted. As of today, no member state's E1a observations have been made available for 2022, and therefore, the E2a observations are all that are available for 2022 at this stage. So everything is consistent from my point of view and the discrepancies can be explained, but it does add a wrinkle to data analysis activities.

This being said, I am thinking that I will discontinue this service. I am getting several messages a month with users questioning/inquiring/complaining about the data. My objective has always been to make European air quality observations accessible but I no longer use this database myself much and my capacity is limited. I will have a chat with some colleagues about a possible migration, but it might be worth evaluating the workload that will be required to query the data portals directly and do your own cleaning. I will not switch things off without appropriate notification, but it might be something to plan for. Thanks for the help and have a great weekend! Stuart.

BlaiseKelly commented 10 months ago

Hi Stuart, thanks for theclarification. As you say there is absolutely no issue with this package so you can close this issue.

The intention was not to complain about the service, which is extremely useful. Maybe there are better ways to understand the data than to raise an issue here. Apologies.

skgrange commented 10 months ago

Hello Blaise, No need to appologise. These things need to be looked into and it is always good to know what is going on, and in this case, we now understand the situation. I am more concerned with comments by others who demand explanations on observations that may or may not be correct that have been delivered from a data submission by a member state. There is always the possibility that there is an issue on my side, so these issues need to be looked into, but my capacity is becoming increasingly limited due to more or less moving on from this work.

No problems, something will get sorted. Have a great week! Stuart.