ropensci / weathercan

R package for downloading weather data from Environment and Climate Change Canada
https://docs.ropensci.org/weathercan
GNU General Public License v3.0
102 stars 29 forks source link

How to select stations that has complete data between a specified period? #99

Closed AmeerDotHydro closed 4 years ago

AmeerDotHydro commented 4 years ago

I am trying to select all stations in Saskatchewan that has data for my chosen period. I don't want stations that has data for only a few years within my defined time-period. I am actually looking for stations that have complete records for 1990-2015 (dont want stations that start, for example, in 1993 and end in 2010). any help would be appreciated. SK_stations <- filter(stations, prov == "SK", interval == "day", start >= 1990, end <= 2015) SKData <- weather_dl(SK_stations$station_id, start = "1990-01-01", end = "2015-12-31", interval = "day")

boshek commented 4 years ago

👋 @AmeerDotHydro

I think this is something that happens outside of weathercan since tools like dplyr are so good at this task.

A quick off the top of head tidyverse heavy solution is this:

library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)

SKData %>%
  group_by(station_name, station_id) %>% 
  nest() %>% 
  mutate(num_years = map_int(data, ~n_distinct(year(.x$date)))) %>% 
  filter(num_years == 26) %>% 
  unnest(cols = c(data))

I might be missing something though because all the data have values from 1990 to 2015.

steffilazerte commented 4 years ago

@boshek's solution is a good one because weathercan has no way of knowing if stations have complete data. Stations may list 1990 as the start and 2015 as the end, but if the station stopped working for 2000-2002, you wouldn't know that until you'd downloaded the data.

Then you can use this code from @boshek to figure out if there are any missing years.

However, I would be tempted to take any station that started on or BEFORE 1990 and stopped on or AFTER 2015. If a station started in 1989 and stopped in 2016, it would still have data from 1990 to 2015.

I would also explore possible missing data with the naniar package

library(weathercan)
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
library(naniar) # explore missing values

# Get all stations in Saskatchewan which were operating between 1990 and 2015
SK_stations <- filter(stations, prov == "SK", interval == "day", 
                      start <= 1990, end >= 2015)

# Get data for these stations between 1990 and 2015
SKData <- weather_dl(SK_stations$station_id, 
                     start = "1990-01-01", end = "2015-12-31", interval = "day")

# Explore missing values
gg_miss_var(SKData, facet = year)

# Filter to only stations with data in every year
SKData %>%
  group_by(station_name, station_id) %>% 
  nest() %>% 
  mutate(num_years = map_int(data, ~n_distinct(year(.x$date)))) %>% 
  filter(num_years == 26) %>% 
  unnest(cols = c(data))
AmeerDotHydro commented 4 years ago

Thanks for the suggestions-I needed this to filter all SK stations that have it least 15-20 years data that i can use for computing statistics (mean, median, upper and lower quartile). I would then use the current year precipitation and have plots for these station like attached. I can do the plotting but filtering out the station with data threshold is something i was looking for. Ex_figure

steffilazerte commented 4 years ago

In that case, something along the lines of the above suggestions should work. Good luck!