ropensci / rnoaa

R interface to many NOAA data APIs
https://docs.ropensci.org/rnoaa
Other
330 stars 84 forks source link

ghcnd_search() stopped pulling data after June 2018 #269

Closed kgmccann closed 5 years ago

kgmccann commented 6 years ago

Not sure if this is isolated to the county I specifically need but following code stops returning rows after 6/30/2018 (not even dates with NAs). I checked here just to be sure that the station was still updating and it is, but even if it was offline, I would expect the function to return up to my date_max. Either way, this code is not behaving the way it has in the past.

library(rnoaa)
library(dplyr)
broward1 <-ghcnd_search(stationid = "US1FLBW0007",date_min = "2017-06-01",date_max = "2018-08-10",var = "prcp")
broward1$prcp %>% select(date,prcp) %>% arrange(desc(date))
System Info setting value version R version 3.5.0 (2018-04-23) system x86_64, mingw32 ui RStudio (1.1.453) language (EN) collate English_United States.1252 tz America/New_York date 2018-08-13
sckott commented 6 years ago

thanks @kgmccann ! i'll have a look

sckott commented 6 years ago

p.s. when you share session info can you share the output of sessionInfo() after rnoaa is loaded so i can see what version of rnoaa is installed and what version of its dependencies

sckott commented 6 years ago

is this what you expect:

broward1$prcp %>% select(date,prcp) %>% arrange(desc(date))
#> # A tibble: 436 x 2
#>    date        prcp
#>    <date>     <int>
#>  1 2018-08-10    53
#>  2 2018-08-09     0
#>  3 2018-08-08     0
#>  4 2018-08-07    NA
#>  5 2018-08-06    NA
#>  6 2018-08-05    NA
#>  7 2018-08-04    NA
#>  8 2018-08-03    NA
#>  9 2018-08-02    NA
#> 10 2018-08-01     0

does seem like data is returned up to your max date.

kgmccann commented 6 years ago

Thanks, that is what i was looking for. maybe it's my version?

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.5    rnoaa_0.7.0   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17     xml2_1.2.0       bindr_0.1.1      magrittr_1.5     rappdirs_0.3.1   tidyselect_0.2.4 munsell_0.4.3   
 [8] colorspace_1.3-2 R6_2.2.2         rlang_0.2.0      hoardr_0.2.0     stringr_1.3.1    httr_1.3.1       plyr_1.8.4      
[15] dplyr_0.7.5      tools_3.5.0      grid_3.5.0       gtable_0.2.0     digest_0.6.15    yaml_2.1.19      lazyeval_0.2.1  
[22] assertthat_0.2.0 tibble_1.4.2     bindrcpp_0.2.2   gridExtra_2.3    tidyr_0.8.1      purrr_0.2.4      ggplot2_3.0.0   
[29] glue_1.2.0       stringi_1.1.7    compiler_3.5.0   pillar_1.2.3     scales_0.5.0     XML_3.98-1.11    lubridate_1.7.4 
[36] jsonlite_1.5     pkgconfig_2.0.1 
sckott commented 6 years ago

Can you try installing from github remotes::install_github("ropensci/rnoaa") and try again?

kgmccann commented 6 years ago

Still stopping at 06-30.
The only thing I am not showing is how I am setting my token. it's like this:

api_token <- 'TOKENtokenToKENtokenToken'
options("noaakey" = api_token)

And the rest is:

>library(rnoaa)
>library(dplyr)
> broward1 <-ghcnd_search(stationid = "US1FLBW0007",date_min = "2017-06-01",date_max = "2018-08-10",var = "prcp")
> broward1$prcp %>% select(date,prcp) %>% arrange(desc(date))
# A tibble: 395 x 2
   date        prcp
   <date>     <int>
 1 2018-06-30    NA
 2 2018-06-29    NA
 3 2018-06-28    NA
 4 2018-06-27    NA
 5 2018-06-26    10
 6 2018-06-25     0
 7 2018-06-24    33
 8 2018-06-23    10
 9 2018-06-22   206
10 2018-06-21     0
# ... with 385 more rows
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2   dplyr_0.7.6      rnoaa_0.7.1.9326

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18     pillar_1.3.0     compiler_3.5.0   plyr_1.8.4       bindr_0.1.1      remotes_1.1.1   
 [7] tools_3.5.0      digest_0.6.15    jsonlite_1.5     lubridate_1.7.4  tibble_1.4.2     gtable_0.2.0    
[13] lattice_0.20-35  pkgconfig_2.0.1  rlang_0.2.1      Matrix_1.2-14    cli_1.0.0        rstudioapi_0.7  
[19] yaml_2.1.19      gridExtra_2.3    xml2_1.2.0       httr_1.3.1       stringr_1.3.1    rappdirs_0.3.1  
[25] grid_3.5.0       tidyselect_0.2.4 glue_1.3.0       R6_2.2.2         fansi_0.2.3      XML_3.98-1.15   
[31] hoardr_0.2.0     tidyr_0.8.1      ggplot2_3.0.0    purrr_0.2.5      magrittr_1.5     scales_1.0.0    
[37] assertthat_0.2.0 colorspace_1.3-2 utf8_1.1.4       stringi_1.1.7    lazyeval_0.2.1   munsell_0.5.0   
[43] crayon_1.3.4    
> 
sckott commented 6 years ago

can you run ghcnd_clear_cache() and then try your code again, let me know what happens.

kgmccann commented 6 years ago

It worked! thanks so much.

I have a follow-up question: I am planning on setting this up to run as a automated procedure to periodically update a weather table. Would you recommend using the clear cache function every time? Thanks!

sckott commented 6 years ago

At this point yes.

But i will see if i can cache the files including the date so that you then shouldn't have to clear the cache.

sckott commented 6 years ago

So I think this was the problem:

You requested data for station US1FLBW0007 at some date X and it was cached on your machine. Then you did a subsequent request (your code at the top of this issue) and it used the cache file, and it had been long enough that the date range you requested wasn't in the cached file (at least some of the dates that is).

The issue is that the same file is downloaded for a station (e.g., US1FLBW0007) whether you restrtict to dates or not. There's just no way to download the GHCND data form ftp server by date, you get the whole thing or not at all 😄

So if we cache files for every combination of stationid, min date and max date, that could lead to lots of files cached on your machine that are really all the same thing. SO unnecessarily taking up disk space.

So here's the approach we'll take: for ghcnd() collect file path and last modified date and print to console so users know where the file is and when it was last updated. For ghcnd_search() (which wraps ghcnd, so will print file path and last modified), also print min and max dates in the file. Also added a refresh parameter so you can force a re download of the file based on that info. You can also programatically get the file path and last modified date like:

x <- ghcnd("US1FLBW0007")
attr(x, "source")
attr(x, "file_modified")

reinstall with remotes::install_github("ropensci/rnoaa")

interested to hear your thoughts.

scoyoc commented 6 years ago

Thanks for this fix. I have been struggling with this same issue for a couple months now. My work computer would only download data through the 1st of the year, but I could download all the data through the current date on my machine at home. Now that I've read through this, it makes sense that clearing the cache fixes the issue.

Thanks again! MVS

sckott commented 6 years ago

@scoyoc great, glad this helps

sckott commented 5 years ago

AFAICT seems sorted