Closed amoeba closed 6 years ago
thanks @amoeba see also https://github.com/ropensci/rnoaa/issues/259
There's lots of data that we're pulling in in this pkg, so we aren't doing data transformations for the user, but rather trying to get the data.
for the variable TAVG
we need to divide the values by 10 since it's in tenths of degrees C.
Do you think it's a good idea to do these data conversions for users? many or most for GHCND are here ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt - Either way, we should definitely document better what conversions are or are not being done and where to find info on how to convert (which is not always easy to find)
Ah, sorry for the dupe. Search failed me.
Do you think it's a good idea to do these data conversions for users?
This is the perennial problem with (tabular) data isn't it: Divorcing the data from the metadata, esp. units? Some folks are playing around a bit with this in https://github.com/ropensci/EML (this being not losing units information on tabular data).
The core issue I had as a user of the package was that the NOAA API doesn't document any of this or provide it over the API to a client such as rnoaa. I searched around the API documentation and didn't find the confirmation of my hunch about the tenths of a degree. My programmer brain says "stick to the API" and my researcher brain says "don't help the user shoot themselves in the foot" (which is also a programmer brain thing).
Is there a way through the API to find a description such as:
TAVG = Average temperature (tenths of degrees C)
[Note that TAVG from source 'S' corresponds
to an average for the period ending at
2400 UTC rather than local midnight]
I thought this
ncdc_datatypes(stationid = nome_station_id, limit = 1000)
would get me some info but that wasn't quite the ticket. If there was a good way with the API to get the info the users need, the README, help docs, and vignettes could re-use this pattern: Get metadata -> Get data to encourage the pattern.
Thanks for your feedback @amoeba
Yeah, it's a sticky situation. Ideally NOAA would provide richer metadata through the API, or even through FTP dumps of similar data, but they definitely do not. So its up to us to sort out the best band-aid we can
Is there a way through the API to find a description such as
Nope, unfortunately not
I made an attempt at including documentation by basically making their PDF documentation into vignettes in the package, but there's so many different NOAA datasets we interact with that isn't a great idea.
One approach could be as follows: e.g, in isd()
https://github.com/ropensci/rnoaa/blob/master/R/isd.R the parsing is so awful that made another pkg (isdparser
) for the parsing. The isdparser
pkg does have ability to do conversions for the user, but are not done automatically. We haven't brought this conversion bit into rnoaa
yet. Anyway, the approach could be: implement conversions for each variable in each dataset, and let the user toggle whether they want conversions done or not. Maybe there'd be an option to convert during the main function data gathering (e.g. ncdc(..., conversions = TRUE)
) or after (e.g. x = ncdc(...); rnoaa_conversions(x)
)
Or maybe it's better to just document where all the NOAA docs can be found? And make it clear we don't do any conversions
Anyway, the approach could be: implement conversions for each variable in each dataset, and let the user toggle whether they want conversions done or not.
Is this one of those things doesn't scale well in terms of developer time? I'm kind of a fan of API packages being as simple as possible so that (1) the user is least surprised most of the time and (2) the user gets the same result when they use the API with different clients (curl, python, r, etc.) rather than encountering language-specific behaviors.
Or maybe it's better to just document where all the NOAA docs can be found? And make it clear we don't do any conversions
I think this is the most sane way to go. My issue could have been avoided if I was more diligent in searching for metadata. I was relying on https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00106681/detail which gave no information. It doesn't even list TAVG which I'm pretty sure is part of the dataset. 🤷♂️
I'm happy to see this issue closed. The conversation has been fruitful and I appreciate it!
Related to this, I also noticed in the readme that it looks like this tenths-of-degrees issue crops up at https://github.com/ropensci/rnoaa#search-for-data where I see temperatures in the 600s. Also in https://github.com/ropensci/rnoaa/blob/master/vignettes/ncdc_vignette.Rmd
Maybe touching on this issue in those places would be helpful? I could work on that.
Is this one of those things doesn't scale well in terms of developer time?
probably yes
Maybe touching on this issue in those places would be helpful? I could work on that.
if you're offering to help ... please do - i guess we should open new issues for any new work though
Yeah I for sure am. I'll try to do this relatively soon. Thanks!
I think this could be user error but I haven't yet figured out my mistake. The only thing I can think of is that the API returns the data as integers to avoid issues with floating point numbers and, to do that, they multiply these values by 10.
Returns 31.5 when I expect a value around -3 (C). I ran the same code for other years and observed a similar result.
Session Info
```r > devtools::session_info() Session info ------------------------------------------------------------------------------------ setting value version R version 3.5.0 (2018-04-23) system x86_64, darwin15.6.0 ui RStudio (1.2.614) language (EN) collate en_US.UTF-8 tz America/Juneau date 2018-05-07 Packages ---------------------------------------------------------------------------------------- package * version date source assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0) base * 3.5.0 2018-04-24 local bindr 0.1.1 2018-03-13 CRAN (R 3.5.0) bindrcpp * 0.2.2 2018-03-29 CRAN (R 3.5.0) colorspace 1.3-2 2016-12-14 CRAN (R 3.5.0) compiler 3.5.0 2018-04-24 local curl 2.7 2018-05-02 Github (jeroen/curl@01e53c0) datasets * 3.5.0 2018-04-24 local devtools 1.13.5 2018-02-18 CRAN (R 3.5.0) digest 0.6.15 2018-01-28 CRAN (R 3.5.0) dplyr * 0.7.4 2017-09-28 CRAN (R 3.5.0) ggplot2 2.2.1 2016-12-30 CRAN (R 3.5.0) glue 1.2.0 2017-10-29 CRAN (R 3.5.0) graphics * 3.5.0 2018-04-24 local grDevices * 3.5.0 2018-04-24 local grid 3.5.0 2018-04-24 local gridExtra 2.3 2017-09-09 CRAN (R 3.5.0) gtable 0.2.0 2016-02-26 CRAN (R 3.5.0) hoardr 0.2.0 2017-05-10 CRAN (R 3.5.0) httr 1.3.1 2017-08-20 CRAN (R 3.5.0) jsonlite 1.5 2017-06-01 CRAN (R 3.5.0) lazyeval 0.2.1 2017-10-29 CRAN (R 3.5.0) lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0) magrittr 1.5 2014-11-22 CRAN (R 3.5.0) memoise 1.1.0 2017-04-21 CRAN (R 3.5.0) methods * 3.5.0 2018-04-24 local munsell 0.4.3 2016-02-13 CRAN (R 3.5.0) pillar 1.2.2 2018-04-26 CRAN (R 3.5.0) pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0) plyr 1.8.4 2016-06-08 CRAN (R 3.5.0) purrr 0.2.4 2017-10-18 CRAN (R 3.5.0) R6 2.2.2 2017-06-17 CRAN (R 3.5.0) rappdirs 0.3.1 2016-03-28 CRAN (R 3.5.0) Rcpp 0.12.16 2018-03-13 CRAN (R 3.5.0) rlang 0.2.0 2018-02-20 CRAN (R 3.5.0) rnoaa * 0.7.0 2017-05-06 CRAN (R 3.5.0) scales 0.5.0 2017-08-24 CRAN (R 3.5.0) stats * 3.5.0 2018-04-24 local stringi 1.1.7 2018-03-12 CRAN (R 3.5.0) stringr 1.3.0 2018-02-19 CRAN (R 3.5.0) tibble 1.4.2 2018-01-22 CRAN (R 3.5.0) tidyr 0.8.0 2018-01-29 CRAN (R 3.5.0) tools 3.5.0 2018-04-24 local utils * 3.5.0 2018-04-24 local withr 2.1.2 2018-03-15 CRAN (R 3.5.0) XML 3.98-1.11 2018-04-16 CRAN (R 3.5.0) xml2 1.2.0 2018-01-24 CRAN (R 3.5.0) ```