ropensci / rnoaa

R interface to many NOAA data APIs
https://docs.ropensci.org/rnoaa
Other
330 stars 84 forks source link

unknown column: precipitation ISD #168

Closed rjbehnke closed 7 years ago

rjbehnke commented 8 years ago

Hi,

When I use the rnoaa package to get ISD data, I often get the warning message "unknown column 'precipitation' ". Is there a way to fix this? I am using this package to download the ISD data set for North American stations. I downloaded the isd station history, and am going through each station at a time.

Thank you, Ruben Behnke

sckott commented 8 years ago

Thanks for your message. Please paste in your sessionInfo() when you have rnoaa loaded

sckott commented 8 years ago

And any example usage of isd() when you get that warning

rjbehnke commented 8 years ago

Hi,

Here you go. Thank you!

R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.1.0 dplyr_0.5.0   plyr_1.8.4    rerddap_0.3.4 rnoaa_0.6.0  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7      xml2_1.0.0       magrittr_1.5     rappdirs_0.3.1   munsell_0.4.3    colorspace_1.2-6 R6_2.1.3         httr_1.2.1       tools_3.3.1      grid_3.3.1      
[11] data.table_1.9.6 gtable_0.2.0     DBI_0.5          assertthat_0.1   digest_0.6.10    tibble_1.2       gridExtra_2.2.1  ggplot2_2.1.0    tidyr_0.6.0      curl_1.2        
[21] ncdf4_1.15       mime_0.5         stringi_1.1.1    scales_0.4.0     XML_3.98-1.4     jsonlite_1.0     lubridate_1.5.6  chron_2.3-47    

Example CODE: (Note that the warning messages here come when the download failed, but I have seen it for successful downloads, as well).

[1] 1965
Error : download failed for
   ftp://ftp.ncdc.noaa.gov/pub/data/noaa/1966/690070-93217-1966.gz
In addition: Warning messages:
1: Unknown column 'precipitation' 
2: Unknown column 'precipitation' 
3: Unknown column 'precipitation' 

Ruben

rjbehnke commented 8 years ago
for (yr in 1901:2016) {
  try(
    assign(paste("data",yr,sep=""), 
           isd(isd_history$USAF[stn], 
               isd_history$WBAN[stn], year = yr, path = "I:\\ISD", 
               overwrite = TRUE,cleanup = TRUE)$data)
  )
  print(yr)
}
sckott commented 8 years ago

thanks, that warning comes from tibble, the output data.frame is special kind of data.frame, of class tbl_df

it's just a warning, but I've just added suppressWarnings to the parsing code so that shouldn't show up anymore. reinstall devtools::install_github("ropensci/rnoaa") and try again

Can you share the the code I need to reproduce the error above

Error : download failed for
   ftp://ftp.ncdc.noaa.gov/pub/data/noaa/1966/690070-93217-1966.gz

that works for me, not sure why it doesn't for you, perhaps a path problem

rjbehnke commented 8 years ago

Here is the zipped R code I'm using. I get 'download failed' errors a lot. Maybe my code is just bad. I don't know. I'm not a super experienced programmer.

Get_ISD.zip

rjbehnke commented 8 years ago

I was also wondering if I can just use rnoaa to parse downloaded ISD .gz files. Is there a way to do this? I really appreciate your help.

Ruben

sckott commented 8 years ago

Thanks I'll take a look at your code and get back to you here

I was also wondering if I can just use rnoaa to parse downloaded ISD .gz files. Is there a way to do this? I really appreciate your help.

Not at the moment, but I can expose a function to do that, see #169

rjbehnke commented 8 years ago

It seems like I can already parse the data just by pointing the path to the directory where the files are located, but a specific function to do this would be great. I am currently downloading all the files.

sckott commented 8 years ago

@rjbehnke you use a file isd-history.csv i don't have access to that.

rjbehnke commented 8 years ago

Here's the isd-history file (contains only North American stations). The isd_read() function works great!

isd-history2.zip

rjbehnke commented 8 years ago

One other thing I can think of is the option to include/not include bad data in the output. There are a lot of different flags in the ISD data, and missing data is represented by different values for each variable, so I don't know how much automation you want to include in a function. But, for people who just want some nice output, perhaps some automation is ok. I have written QC routines for hourly data from ISD and other networks, but I am refining these routines (they need it before I feel comfortable making them available).

sckott commented 8 years ago

thanks for the file.

There are a lot of different flags in the ISD data, and missing data is represented by different values for each variable, so I don't know how much automation you want to include in a function. But, for people who just want some nice output, perhaps some automation is ok.

Correct. The data is pretty messy. Do you have code already to clean them up?

rjbehnke commented 8 years ago

Scott,

I do have code, but it is not ready for 'production'. I am starting to refine it (which it desperately needs), but since I am also trying to graduate, it's not a fast process. You are welcome to take a look at it, and work with me in making it better (much better) if you want. Just let me know, and I can send you my code (whether or not it is understandable might be a different story:). There are some major things I want to change.

This code was used for QC of all kinds of sources of data, ranging from ISD to RAWS to many local/regional mesonets. So, it is generalized, and meant for hourly, not daily, data. It is also focused on humidity (specifically, dew point), but it does do general checks on RH and temperature. I would like to write an R package that users who collect their own data or download data from sources that do not do their own QC can use to perform QC. This is a BIG, challenging project, though. I will say that right now, I am likely removing more good data than I care to admit. But, for my work, I'm more concerned about the influence of even a couple bad data values.

Ruben


From: Scott Chamberlain [notifications@github.com] Sent: Monday, September 12, 2016 2:51 PM To: ropensci/rnoaa Cc: Behnke, Ruben; Mention Subject: Re: [ropensci/rnoaa] unknown column: precipitation ISD (#168)

thanks for the file.

There are a lot of different flags in the ISD data, and missing data is represented by different values for each variable, so I don't know how much automation you want to include in a function. But, for people who just want some nice output, perhaps some automation is ok.

Correct. The data is pretty messy. Do you have code already to clean them up?

� You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ropensci/rnoaa/issues/168#issuecomment-246488881, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVFU2__vJgO8AK7lzOwqPeMARyC6Dk5Bks5qpbtmgaJpZM4J5fBB.

sckott commented 8 years ago

one thing to note is that I recently https://github.com/ropensci/rnoaa/commit/201ad62d7a9cae970426b5f54a4873dc196760cc changed the output of isd() to a tibble (data.frame) instead of a data.frame nested in a list

sckott commented 8 years ago

try it again after reinstalling devtools::install_github("ropensci/rnoaa")

here's a simpler version of your script, just focusing on making sure the file downloading/etc is working correctly. I think you shouldn't hit download fails anymore, though you might

library(dplyr)
library(rnoaa)

isd_history <- read.csv('~/Downloads/isd-history2.csv')
isd_history$CTRY <- as.character(isd_history$CTRY); isd_history$STATION.NAME <- as.character(isd_history$STATION.NAME)
isd_history <- subset(isd_history, isd_history$CTRY == 'US' | isd_history$CTRY == 'CA' | isd_history$CTRY == 'MX')
isd_history <- subset(isd_history, STATION.NAME != 'MOORED BUOY')

low <- which(isd_history$WBAN < 1000)
med <- which(isd_history$WBAN >= 1000 & isd_history$WBAN <= 9999)
isd_history$WBAN[low] <- paste('00',isd_history$WBAN[low],sep='')
isd_history$WBAN[med] <- paste('0',isd_history$WBAN[med],sep='')
isd_history$ID <- paste(isd_history$USAF,'-',isd_history$WBAN,sep='')

for (stn in 1:10) {
  cat(stn, "\n")
  begin <- as.numeric(substr(isd_history$BEGIN[stn],1,4))
  end <- as.numeric(substr(isd_history$END[stn],1,4))

  for (yr in begin:end) {
    cat("  working on:", yr, "\n")
    res <- tryCatch(
      isd(isd_history$USAF[stn], isd_history$WBAN[stn], year = yr),
      error = function(e) e
    )
    if (inherits(res, "error")) {
      cat("failed on ", isd_history$USAF[stn], isd_history$WBAN[stn], yr, "\n")
    }
  }
}
sckott commented 8 years ago

Went through many the first 6 or so rows of that history file, and it turns out there's some files that just don't exist on NOAA ftp servers , e.g, here's the ones that failed - For each of the stations below, there are some years that worked fine, but others failed, and I looked on the ftp servers and those that failed just didn't have a file. So do use tryCatch() and just skip if the file is not found in your for loop. I'll add something to the docs about files not existing

## 621370-99999

failed on  621370 99999 2006 
failed on  621370 99999 2007 
failed on  621370 99999 2008 
failed on  621370 99999 2009 
failed on  621370 99999 2010 
failed on  621370 99999 2011 
failed on  621370 99999 2012 
failed on  621370 99999 2013 

## 690020-93218

failed on  690020 93218 1972 
failed on  690020 93218 1973 
failed on  690020 93218 1974 
failed on  690020 93218 1975 
failed on  690020 93218 1976 
failed on  690020 93218 1977 
failed on  690020 93218 1978 
failed on  690020 93218 1979 
failed on  690020 93218 1980 
failed on  690020 93218 1981 
failed on  690020 93218 1982 
failed on  690020 93218 1983 
failed on  690020 93218 1984 
failed on  690020 93218 1985 
failed on  690020 93218 1986 
failed on  690020 93218 1987 
failed on  690020 93218 1988 

## 690070-93217

failed on  690070 93217 1971 
failed on  690070 93217 1972 
failed on  690070 93217 1973 
failed on  690070 93217 1974 
failed on  690070 93217 1975 
failed on  690070 93217 1976 
failed on  690070 93217 1977 
failed on  690070 93217 1978 
failed on  690070 93217 1979 
failed on  690070 93217 1980 
failed on  690070 93217 1981 
failed on  690070 93217 1982 
failed on  690070 93217 1983 
failed on  690070 93217 1984 
failed on  690070 93217 1985 
failed on  690070 93217 1986 
failed on  690070 93217 1987 
failed on  690070 93217 1988 
failed on  690070 93217 1989 
failed on  690070 93217 1990 

## 690110-99999

failed on  690110 99999 1947 
failed on  690110 99999 1948 
rjbehnke commented 8 years ago

Thanks Scott. It just seemed strange that there was so many years missing from the middle of a time series from a station. I guess its just the way ISD is.


From: Scott Chamberlain [notifications@github.com] Sent: Tuesday, September 13, 2016 2:08 PM To: ropensci/rnoaa Cc: Behnke, Ruben; Mention Subject: Re: [ropensci/rnoaa] unknown column: precipitation ISD (#168)

Went through many the first 6 or so rows of that history file, and it turns out there's some files that just don't exist on NOAA ftp servers , e.g, here's the ones that failed - For each of the stations below, there are some years that worked fine, but others failed, and I looked on the ftp servers and those that failed just didn't have a file. So do use tryCatch() and just skip if the file is not found in your for loop. I'll add something to the docs about files not existing

621370-99999

failed on 621370 99999 2006 failed on 621370 99999 2007 failed on 621370 99999 2008 failed on 621370 99999 2009 failed on 621370 99999 2010 failed on 621370 99999 2011 failed on 621370 99999 2012 failed on 621370 99999 2013

690020-93218

failed on 690020 93218 1972 failed on 690020 93218 1973 failed on 690020 93218 1974 failed on 690020 93218 1975 failed on 690020 93218 1976 failed on 690020 93218 1977 failed on 690020 93218 1978 failed on 690020 93218 1979 failed on 690020 93218 1980 failed on 690020 93218 1981 failed on 690020 93218 1982 failed on 690020 93218 1983 failed on 690020 93218 1984 failed on 690020 93218 1985 failed on 690020 93218 1986 failed on 690020 93218 1987 failed on 690020 93218 1988

690070-93217

failed on 690070 93217 1971 failed on 690070 93217 1972 failed on 690070 93217 1973 failed on 690070 93217 1974 failed on 690070 93217 1975 failed on 690070 93217 1976 failed on 690070 93217 1977 failed on 690070 93217 1978 failed on 690070 93217 1979 failed on 690070 93217 1980 failed on 690070 93217 1981 failed on 690070 93217 1982 failed on 690070 93217 1983 failed on 690070 93217 1984 failed on 690070 93217 1985 failed on 690070 93217 1986 failed on 690070 93217 1987 failed on 690070 93217 1988 failed on 690070 93217 1989 failed on 690070 93217 1990

690110-99999

failed on 690110 99999 1947 failed on 690110 99999 1948

� You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ropensci/rnoaa/issues/168#issuecomment-246807434, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVFU28hMKNMokl1oGIWRgnmHm2UkgTEPks5qpwK8gaJpZM4J5fBB.

sckott commented 8 years ago

Right, I guess that's the way it is

rjbehnke commented 8 years ago

Scott,

The read_isd function works very good, but there are some errors that arise when trying to read the csv files written out after using the isd_read function.  I assume these are probably associated with errors in the NCDC files.  Here is a list of them.  I would suggest that functionality be included with the isd_read function to look for these errors and either correct them or remove the rows they occur on (I have not seen any valid data on rows these errors occur on).

1) The columns 'total_chars','usaf_station','wban_station", "date", and 'time' occasionally have bad values (or no data whatsoever) that look like "+0230" or "-0700", etc.

2) Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names (ex. "697774-99999")

3)

Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed (ex. "467425-99999")

4) In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string

5)

Error in as.POSIXlt.character(x, tz, ...) : character string is not in a standard unambiguous format

Ruben Behnke


From: Behnke, Ruben Sent: Tuesday, September 13, 2016 5:47 PM To: ropensci/rnoaa; ropensci/rnoaa Cc: Mention Subject: RE: [ropensci/rnoaa] unknown column: precipitation ISD (#168)

Thanks Scott. It just seemed strange that there was so many years missing from the middle of a time series from a station. I guess its just the way ISD is.


From: Scott Chamberlain [notifications@github.com] Sent: Tuesday, September 13, 2016 2:08 PM To: ropensci/rnoaa Cc: Behnke, Ruben; Mention Subject: Re: [ropensci/rnoaa] unknown column: precipitation ISD (#168)

Went through many the first 6 or so rows of that history file, and it turns out there's some files that just don't exist on NOAA ftp servers , e.g, here's the ones that failed - For each of the stations below, there are some years that worked fine, but others failed, and I looked on the ftp servers and those that failed just didn't have a file. So do use tryCatch() and just skip if the file is not found in your for loop. I'll add something to the docs about files not existing

621370-99999

failed on 621370 99999 2006 failed on 621370 99999 2007 failed on 621370 99999 2008 failed on 621370 99999 2009 failed on 621370 99999 2010 failed on 621370 99999 2011 failed on 621370 99999 2012 failed on 621370 99999 2013

690020-93218

failed on 690020 93218 1972 failed on 690020 93218 1973 failed on 690020 93218 1974 failed on 690020 93218 1975 failed on 690020 93218 1976 failed on 690020 93218 1977 failed on 690020 93218 1978 failed on 690020 93218 1979 failed on 690020 93218 1980 failed on 690020 93218 1981 failed on 690020 93218 1982 failed on 690020 93218 1983 failed on 690020 93218 1984 failed on 690020 93218 1985 failed on 690020 93218 1986 failed on 690020 93218 1987 failed on 690020 93218 1988

690070-93217

failed on 690070 93217 1971 failed on 690070 93217 1972 failed on 690070 93217 1973 failed on 690070 93217 1974 failed on 690070 93217 1975 failed on 690070 93217 1976 failed on 690070 93217 1977 failed on 690070 93217 1978 failed on 690070 93217 1979 failed on 690070 93217 1980 failed on 690070 93217 1981 failed on 690070 93217 1982 failed on 690070 93217 1983 failed on 690070 93217 1984 failed on 690070 93217 1985 failed on 690070 93217 1986 failed on 690070 93217 1987 failed on 690070 93217 1988 failed on 690070 93217 1989 failed on 690070 93217 1990

690110-99999

failed on 690110 99999 1947 failed on 690110 99999 1948

� You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ropensci/rnoaa/issues/168#issuecomment-246807434, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVFU28hMKNMokl1oGIWRgnmHm2UkgTEPks5qpwK8gaJpZM4J5fBB.

sckott commented 8 years ago

thanks @rjbehnke for this info. really helpful. It would be even more helpful if you could tell me which dataset requests lead to those errors, so I can quickly get examples that I can play with to sort these errors out.

rjbehnke commented 8 years ago

Scott,

Here's a document with info on the errors. I attached the script I'm using. Please let me know if you need something else.

Ruben


From: Behnke, Ruben Sent: Saturday, October 01, 2016 2:04 PM To: ropensci/rnoaa; ropensci/rnoaa Cc: Mention Subject: read_isd errors

Scott,

The read_isd function works very good, but there are some errors that arise when trying to read the csv files written out after using the isd_read function.  I assume these are probably associated with errors in the NCDC files.  Here is a list of them.  I would suggest that functionality be included with the isd_read function to look for these errors and either correct them or remove the rows they occur on (I have not seen any valid data on rows these errors occur on).

1) The columns 'total_chars','usaf_station','wban_station", "date", and 'time' occasionally have bad values (or no data whatsoever) that look like "+0230" or "-0700", etc.

2) Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names (ex. "697774-99999")

3)

Error in read.table(file = file, header = header, sep = sep, quote = quote, : duplicate 'row.names' are not allowed (ex. "467425-99999")

4) In addition: Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string

5)

Error in as.POSIXlt.character(x, tz, ...) : character string is not in a standard unambiguous format

Ruben Behnke


From: Behnke, Ruben Sent: Tuesday, September 13, 2016 5:47 PM To: ropensci/rnoaa; ropensci/rnoaa Cc: Mention Subject: RE: [ropensci/rnoaa] unknown column: precipitation ISD (#168)

Thanks Scott. It just seemed strange that there was so many years missing from the middle of a time series from a station. I guess its just the way ISD is.


From: Scott Chamberlain [notifications@github.com] Sent: Tuesday, September 13, 2016 2:08 PM To: ropensci/rnoaa Cc: Behnke, Ruben; Mention Subject: Re: [ropensci/rnoaa] unknown column: precipitation ISD (#168)

Went through many the first 6 or so rows of that history file, and it turns out there's some files that just don't exist on NOAA ftp servers , e.g, here's the ones that failed - For each of the stations below, there are some years that worked fine, but others failed, and I looked on the ftp servers and those that failed just didn't have a file. So do use tryCatch() and just skip if the file is not found in your for loop. I'll add something to the docs about files not existing

621370-99999

failed on 621370 99999 2006 failed on 621370 99999 2007 failed on 621370 99999 2008 failed on 621370 99999 2009 failed on 621370 99999 2010 failed on 621370 99999 2011 failed on 621370 99999 2012 failed on 621370 99999 2013

690020-93218

failed on 690020 93218 1972 failed on 690020 93218 1973 failed on 690020 93218 1974 failed on 690020 93218 1975 failed on 690020 93218 1976 failed on 690020 93218 1977 failed on 690020 93218 1978 failed on 690020 93218 1979 failed on 690020 93218 1980 failed on 690020 93218 1981 failed on 690020 93218 1982 failed on 690020 93218 1983 failed on 690020 93218 1984 failed on 690020 93218 1985 failed on 690020 93218 1986 failed on 690020 93218 1987 failed on 690020 93218 1988

690070-93217

failed on 690070 93217 1971 failed on 690070 93217 1972 failed on 690070 93217 1973 failed on 690070 93217 1974 failed on 690070 93217 1975 failed on 690070 93217 1976 failed on 690070 93217 1977 failed on 690070 93217 1978 failed on 690070 93217 1979 failed on 690070 93217 1980 failed on 690070 93217 1981 failed on 690070 93217 1982 failed on 690070 93217 1983 failed on 690070 93217 1984 failed on 690070 93217 1985 failed on 690070 93217 1986 failed on 690070 93217 1987 failed on 690070 93217 1988 failed on 690070 93217 1989 failed on 690070 93217 1990

690110-99999

failed on 690110 99999 1947 failed on 690110 99999 1948

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ropensci/rnoaa/issues/168#issuecomment-246807434, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVFU28hMKNMokl1oGIWRgnmHm2UkgTEPks5qpwK8gaJpZM4J5fBB.

sckott commented 8 years ago

@rjbehnke didn't get the attachment. I think you have to use the github web interface maybe, or email it to me.

sckott commented 8 years ago

see file in #169

sckott commented 7 years ago

closing for now, let me know if there's anything we didn't sort out @rjbehnke