ropensci / rnoaa

R interface to many NOAA data APIs
https://docs.ropensci.org/rnoaa
Other
330 stars 84 forks source link

GHCND values off by roughly one order of magnitude #264

Closed amoeba closed 6 years ago

amoeba commented 6 years ago

I think this could be user error but I haven't yet figured out my mistake. The only thing I can think of is that the API returns the data as integers to avoid issues with floating point numbers and, to do that, they multiply these values by 10.

library(rnoaa)
library(dplyr)

ncdc(datasetid = "GHCND", 
     stationid = 'GHCND:USW00026617', 
     startdate = '2018-04-01', 
     enddate = '2018-04-30', 
     limit = 1000)$data %>% 
  filter(datatype == "TAVG") %>% 
  summarize(tavg_mean = round(mean(value), 2))

Returns 31.5 when I expect a value around -3 (C). I ran the same code for other years and observed a similar result.

Session Info ```r > devtools::session_info() Session info ------------------------------------------------------------------------------------ setting value version R version 3.5.0 (2018-04-23) system x86_64, darwin15.6.0 ui RStudio (1.2.614) language (EN) collate en_US.UTF-8 tz America/Juneau date 2018-05-07 Packages ---------------------------------------------------------------------------------------- package * version date source assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0) base * 3.5.0 2018-04-24 local bindr 0.1.1 2018-03-13 CRAN (R 3.5.0) bindrcpp * 0.2.2 2018-03-29 CRAN (R 3.5.0) colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0) compiler 3.5.0 2018-04-24 local curl 2.7 2018-05-02 Github (jeroen/curl@01e53c0) datasets * 3.5.0 2018-04-24 local devtools 1.13.5 2018-02-18 CRAN (R 3.5.0) digest 0.6.15 2018-01-28 CRAN (R 3.5.0) dplyr * 0.7.4 2017-09-28 CRAN (R 3.5.0) ggplot2 2.2.1 2016-12-30 CRAN (R 3.5.0) glue 1.2.0 2017-10-29 CRAN (R 3.5.0) graphics * 3.5.0 2018-04-24 local grDevices * 3.5.0 2018-04-24 local grid 3.5.0 2018-04-24 local gridExtra 2.3 2017-09-09 CRAN (R 3.5.0) gtable 0.2.0 2016-02-26 CRAN (R 3.5.0) hoardr 0.2.0 2017-05-10 CRAN (R 3.5.0) httr 1.3.1 2017-08-20 CRAN (R 3.5.0) jsonlite 1.5 2017-06-01 CRAN (R 3.5.0) lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0) magrittr 1.5 2014-11-22 CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) methods * 3.5.0 2018-04-24 local munsell 0.4.3 2016-02-13 CRAN (R 3.5.0) pillar 1.2.2 2018-04-26 CRAN (R 3.5.0) pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0) plyr 1.8.4 2016-06-08 CRAN (R 3.5.0) purrr 0.2.4 2017-10-18 CRAN (R 3.5.0) R6 2.2.2 2017-06-17 CRAN (R 3.5.0) rappdirs 0.3.1 2016-03-28 CRAN (R 3.5.0) Rcpp 0.12.16 2018-03-13 CRAN (R 3.5.0) rlang 0.2.0 2018-02-20 CRAN (R 3.5.0) rnoaa * 0.7.0 2017-05-06 CRAN (R 3.5.0) scales 0.5.0 2017-08-24 CRAN (R 3.5.0) stats * 3.5.0 2018-04-24 local stringi 1.1.7 2018-03-12 CRAN (R 3.5.0) stringr 1.3.0 2018-02-19 CRAN (R 3.5.0) tibble 1.4.2 2018-01-22 CRAN (R 3.5.0) tidyr 0.8.0 2018-01-29 CRAN (R 3.5.0) tools 3.5.0 2018-04-24 local utils * 3.5.0 2018-04-24 local withr 2.1.2 2018-03-15 CRAN (R 3.5.0) XML 3.98-1.11 2018-04-16 CRAN (R 3.5.0) xml2 1.2.0 2018-01-24 CRAN (R 3.5.0) ```
sckott commented 6 years ago

thanks @amoeba see also https://github.com/ropensci/rnoaa/issues/259

There's lots of data that we're pulling in in this pkg, so we aren't doing data transformations for the user, but rather trying to get the data.

for the variable TAVG we need to divide the values by 10 since it's in tenths of degrees C.


Do you think it's a good idea to do these data conversions for users? many or most for GHCND are here ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt - Either way, we should definitely document better what conversions are or are not being done and where to find info on how to convert (which is not always easy to find)

amoeba commented 6 years ago

Ah, sorry for the dupe. Search failed me.

Do you think it's a good idea to do these data conversions for users?

This is the perennial problem with (tabular) data isn't it: Divorcing the data from the metadata, esp. units? Some folks are playing around a bit with this in https://github.com/ropensci/EML (this being not losing units information on tabular data).

The core issue I had as a user of the package was that the NOAA API doesn't document any of this or provide it over the API to a client such as rnoaa. I searched around the API documentation and didn't find the confirmation of my hunch about the tenths of a degree. My programmer brain says "stick to the API" and my researcher brain says "don't help the user shoot themselves in the foot" (which is also a programmer brain thing).

Is there a way through the API to find a description such as:

           TAVG = Average temperature (tenths of degrees C)
              [Note that TAVG from source 'S' corresponds
           to an average for the period ending at
           2400 UTC rather than local midnight]

I thought this

ncdc_datatypes(stationid = nome_station_id, limit = 1000)

would get me some info but that wasn't quite the ticket. If there was a good way with the API to get the info the users need, the README, help docs, and vignettes could re-use this pattern: Get metadata -> Get data to encourage the pattern.

sckott commented 6 years ago

Thanks for your feedback @amoeba

Yeah, it's a sticky situation. Ideally NOAA would provide richer metadata through the API, or even through FTP dumps of similar data, but they definitely do not. So its up to us to sort out the best band-aid we can

Is there a way through the API to find a description such as

Nope, unfortunately not

I made an attempt at including documentation by basically making their PDF documentation into vignettes in the package, but there's so many different NOAA datasets we interact with that isn't a great idea.

One approach could be as follows: e.g, in isd() https://github.com/ropensci/rnoaa/blob/master/R/isd.R the parsing is so awful that made another pkg (isdparser) for the parsing. The isdparser pkg does have ability to do conversions for the user, but are not done automatically. We haven't brought this conversion bit into rnoaa yet. Anyway, the approach could be: implement conversions for each variable in each dataset, and let the user toggle whether they want conversions done or not. Maybe there'd be an option to convert during the main function data gathering (e.g. ncdc(..., conversions = TRUE)) or after (e.g. x = ncdc(...); rnoaa_conversions(x))

Or maybe it's better to just document where all the NOAA docs can be found? And make it clear we don't do any conversions

amoeba commented 6 years ago

Anyway, the approach could be: implement conversions for each variable in each dataset, and let the user toggle whether they want conversions done or not.

Is this one of those things doesn't scale well in terms of developer time? I'm kind of a fan of API packages being as simple as possible so that (1) the user is least surprised most of the time and (2) the user gets the same result when they use the API with different clients (curl, python, r, etc.) rather than encountering language-specific behaviors.

Or maybe it's better to just document where all the NOAA docs can be found? And make it clear we don't do any conversions

I think this is the most sane way to go. My issue could have been avoided if I was more diligent in searching for metadata. I was relying on https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00106681/detail which gave no information. It doesn't even list TAVG which I'm pretty sure is part of the dataset. 🤷‍♂️

I'm happy to see this issue closed. The conversation has been fruitful and I appreciate it!

Related to this, I also noticed in the readme that it looks like this tenths-of-degrees issue crops up at https://github.com/ropensci/rnoaa#search-for-data where I see temperatures in the 600s. Also in https://github.com/ropensci/rnoaa/blob/master/vignettes/ncdc_vignette.Rmd

Maybe touching on this issue in those places would be helpful? I could work on that.

sckott commented 6 years ago

Is this one of those things doesn't scale well in terms of developer time?

probably yes

Maybe touching on this issue in those places would be helpful? I could work on that.

if you're offering to help ... please do - i guess we should open new issues for any new work though

amoeba commented 6 years ago

Yeah I for sure am. I'll try to do this relatively soon. Thanks!