ropensci / weathercan

R package for downloading weather data from Environment and Climate Change Canada
https://docs.ropensci.org/weathercan
GNU General Public License v3.0
102 stars 29 forks source link

Add argument to prevent padding data with NAs #90

Open klwilson23 opened 4 years ago

klwilson23 commented 4 years ago

What I want to do:

I'm trying to download daily weather data for two stations, in this case from the Port Hardy A station name. These two stations don't overlap in ranges. Station 202 goes from 1944 until 2013, while station 51319 picks up from 2013 until today. Basically, I would just like a single time-series of data that accounts for where each station leaves off or picks up.

Issue?

Basically, the download is creating a single data-frame but duplicating the two-time series: one for each station ID. While I am getting the real data from each station (which is what I am asking for), I am also getting missing data for each station outside the range for each station. It appears to duplicate NA's for each date I requested.

I'm not sure whether this behaviour for merging data across stations is intended or not. I could attempt to remove the duplicated dates manually, but I might have to do some quality control on that. Suggestions?

Example:

Here's the stations for Port Hardy. Notice Port Hardy A has two station IDs and two different ranges that don't overlap.

stations_search("Port Hardy",interval="day")

Then I download those two stations:

portHardy_pg <- weather_dl(station_ids = c(202, 51319), start = "1975-01-01", end = "2018-12-31",interval = "day",trim=TRUE,format=TRUE)

And we can start to see the problem as we look at the temperature for station 202 at the start and end of the range

head(portHardy_pg[portHardy_pg$station_id==202,c(1,2,11,22:24)])
# A tibble: 6 x 6
  station_name station_id date       max_temp max_temp_flag mean_temp
  <chr>             <dbl> <date>        <dbl> <chr>             <dbl>
1 PORT HARDY A        202 1975-01-01      3.9 ""                  2  
2 PORT HARDY A        202 1975-01-02      6.1 ""                  3.1
3 PORT HARDY A        202 1975-01-03      3.9 ""                  2  
4 PORT HARDY A        202 1975-01-04      3.9 ""                  2.3
5 PORT HARDY A        202 1975-01-05      5   ""                  3.6
6 PORT HARDY A        202 1975-01-06      2.8 ""                  0.9
tail(portHardy_pg[portHardy_pg$station_id==202,c(1,2,11,22:24)])

Here we see the duplicated NAs for station 202 at the end of the range

# A tibble: 6 x 6
  station_name station_id date       max_temp max_temp_flag mean_temp
  <chr>             <dbl> <date>        <dbl> <chr>             <dbl>
1 PORT HARDY A        202 2018-12-26       NA ""                   NA
2 PORT HARDY A        202 2018-12-27       NA ""                   NA
3 PORT HARDY A        202 2018-12-28       NA ""                   NA
4 PORT HARDY A        202 2018-12-29       NA ""                   NA
5 PORT HARDY A        202 2018-12-30       NA ""                   NA
6 PORT HARDY A        202 2018-12-31       NA ""                   NA

I get similar issues for station 5139 at the start and end of the range:

head(portHardy_pg[portHardy_pg$station_id==51319,c(1,2,11,22:24)]) # here we see the duplicated NAs for station 5139 at the beginning of the range
# A tibble: 6 x 6
  station_name station_id date       max_temp max_temp_flag mean_temp
  <chr>             <dbl> <date>        <dbl> <chr>             <dbl>
1 PORT HARDY A      51319 1975-01-01       NA ""                   NA
2 PORT HARDY A      51319 1975-01-02       NA ""                   NA
3 PORT HARDY A      51319 1975-01-03       NA ""                   NA
4 PORT HARDY A      51319 1975-01-04       NA ""                   NA
5 PORT HARDY A      51319 1975-01-05       NA ""                   NA
6 PORT HARDY A      51319 1975-01-06       NA ""                   NA
tail(portHardy_pg[portHardy_pg$station_id==51319,c(1,2,11,22:24)])
# A tibble: 6 x 6
  station_name station_id date       max_temp max_temp_flag mean_temp
  <chr>             <dbl> <date>        <dbl> <chr>             <dbl>
1 PORT HARDY A      51319 2018-12-26      5.5 ""                  2.8
2 PORT HARDY A      51319 2018-12-27      5.1 ""                  2.3
3 PORT HARDY A      51319 2018-12-28      4.4 ""                  4  
4 PORT HARDY A      51319 2018-12-29     10.8 ""                  7.4
5 PORT HARDY A      51319 2018-12-30      7   ""                  3.2
6 PORT HARDY A      51319 2018-12-31      5.2 ""                  2.1

Interestingly, if I download only one station but specify a "bad range", then the data download trims itself to the observation period.

For example:

new_dl <- weather_dl(station_ids = 202, start = "1975-01-01", end = "2018-12-31",interval = "day",trim=TRUE,format=TRUE)
tail(new_dl[new_dl$station_id==202,c(1,2,11,22:24)])
# A tibble: 6 x 6
  station_name station_id date       max_temp max_temp_flag mean_temp
  <chr>             <dbl> <date>        <dbl> <chr>             <dbl>
1 PORT HARDY A        202 2013-06-07     16.4 ""                 13.1
2 PORT HARDY A        202 2013-06-08     13.1 ""                 11.4
3 PORT HARDY A        202 2013-06-09     13.8 ""                 10.1
4 PORT HARDY A        202 2013-06-10     15.1 ""                 10.5
5 PORT HARDY A        202 2013-06-11     14.8 ""                 12.3
6 PORT HARDY A        202 2013-06-12     15.5 ""                 12.5

My Environment

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252    LC_MONETARY=English_Canada.1252
[4] LC_NUMERIC=C                    LC_TIME=English_Canada.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] weathercan_0.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3       rstudioapi_0.10  magrittr_1.5     tidyselect_0.2.5 R6_2.4.1         rlang_0.4.1     
 [7] fansi_0.4.0      stringr_1.4.0    httr_1.4.1       dplyr_0.8.3      tools_3.6.1      packrat_0.5.0   
[13] utf8_1.1.4       cli_1.1.0        ellipsis_0.3.0   assertthat_0.2.1 lifecycle_0.1.0  tibble_2.1.3    
[19] crayon_1.3.4     tidyr_1.0.0      purrr_0.3.3      vctrs_0.2.0      curl_4.2         zeallot_0.1.0   
[25] glue_1.3.1       stringi_1.4.3    compiler_3.6.1   pillar_1.4.2     backports_1.1.5  lubridate_1.7.4 
[31] pkgconfig_2.0.3 
steffilazerte commented 4 years ago

Hi @klwilson23!

This behaviour is deliberate because most stations are separate entities and the purpose is to create data frames with comparable time series. Some stations are continuations of each other, but it isn't clear to weathercan when that is the case.

Possibly we could consider adding a pad = FALSE argument (opposite of the trim argument) to avoid padding the time ranges.

For now, you can either combine them, or filter out the NA values. Below is how to combine them (thus preserving NAs in the middle of the range):

library(weathercan)

# Download them separately for the whole time range
# (NAs on the ends they will be trimmed, as you saw)
s1 <- weather_dl(station_ids = 202, start = "1975-01-01", end = "2018-12-31",
                 interval = "day")
s2 <- weather_dl(station_ids = 51319, start = "1975-01-01", end = "2018-12-31",
                 interval = "day")

# Bind the rows together
s <- rbind(s1, s2)

Then, to make sure there is no overlap between the stations we can check it out visually:

library(ggplot2)
library(dplyr)

# Check the time range
ggplot(data = s, aes(x = date, y = max_temp, colour = factor(station_id))) +
  geom_point()

# Check the switch over
ggplot(data = filter(s, date > "2013-04-01", date < "2013-08-01"), 
       aes(x = date, y = max_temp, colour = factor(station_id))) +
  geom_point() +
  geom_line()

Does that address your problem?

klwilson23 commented 4 years ago

Howdy @steffilazerte

That'll work for me! This is a specific data grab that I only need to do a limited number of times. A padding TRUE/FALSE argument could be useful on future work (if it's feasible) in case we go for a bigger regional download for Vancouver Island stations. But there's clever solutions at the back-end of the data grab though in case you want to save yourself the headache.

Thanks for the great package!

steffilazerte commented 4 years ago

Glad it worked! I'm going to leave this issue up as a feature request for the padding argument. It wouldn't be difficult to implement.