Closed lwjohnst86 closed 3 years ago
Thoughts on guiding principles:
A bit of a brain dump of ideas that have occurred to me over the last week or two:
I've built a couple of data exercises around this fivethirty eight article about visits to national parks. A package might include the data (which is a bit painful to get from the NPS), plus a function or two that takes a park name and spits out a summary or plots.
A package with custom climate summaries. E.g. when I'm thinking about living in other places, I don't find a plot of the min temp, max temp and rainfall by month very useful, I want to know things like: how many days a year will I get wet on my commute, or during winter how many days can I expect to go without seeing the sun? We could provide data for one location, and have one or two specific summaries or plots to build, but there would also be lots of room for customization/expansion by instructors or students to other locations, or other summaries.
A very simple package for working with missing values. Functions might include simple versions of things like naniar::count_missing()
, tidyr::replace_na()
, dplyr::na_if()
. Data could be anything that has missing values (or should have missing values) primarily as a way to build and demo the functions.
The book the Nature of Code explores "computer simulations of natural systems using Processing." - I wonder if we could take one and build package around it. E.g. one dimensional cellular automata. The package might have a function for plotting a vector of zeros and one, functions for evolving such a vector, functions that the control evolution etc.
A package for "branded" analysis, e.g. data might be corporate/team/school colors, functions might include a custom ggplot2 theme.
Love these ideas, @cwickham! I was also thinking a weather package would be fun - relevant to everyone in the world, easy to customize to suit specific interests and locations. The package could focus actually fetching, parsing, and cleaning data, or it could focus on building specific summaries. I think summaries is more interesting, personally, but the first part is useful too (and if we include it, we could provide as much hand-holding as we like). Here are some potential data sources:
A very quick google search suggests that this is not a completely saturated problem - a cool, simple weather analyzer package might actually be generally useful as well.
Really liking these ideas! Weather is something that everyone experiences so this is super general purpose!
Along similar lines would be something that makes summaries of "cost of living" in various places. E.g. Numbeo has a bunch of stuff related to that. Not sure how easy it is to get the data but something to think about as well.
Some updates of my exploration of the weather/climate package ideas from exploring the potential data sources mentioned above.
Provides a 7-day forecast, along with a monthly climate summary by hitting a URL.
library(wwis)
city_search("portland")
#> # A tibble: 3 x 3
#> country city cityid
#> <chr> <chr> <dbl>
#> 1 Australia Portland 1720
#> 2 United States of America Portland, Maine 809
#> 3 United States of America Portland, Oregon 810
portland <- city_id("Portland, Oregon")
forecast(portland)
#> # A tibble: 7 x 8
#> forecastDate wxdesc weather minTemp maxTemp minTempF maxTempF weatherIcon
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
#> 1 2021-01-14 "" Sunny "" 13 "" 55 2402
#> 2 2021-01-15 "" Sunny Perio… "7" 11 "45" 51 2201
#> 3 2021-01-16 "" Fog "3" 12 "37" 54 1601
#> 4 2021-01-17 "" Light Showe… "4" 11 "40" 51 1201
#> 5 2021-01-18 "" Sunny Perio… "4" 9 "39" 48 2201
#> 6 2021-01-19 "" Light Showe… "2" 9 "35" 48 1201
#> 7 2021-01-20 "" Mostly Clou… "3" 8 "38" 47 2302
climate(portland)
#> # A tibble: 12 x 10
#> month maxTemp minTemp meanTemp maxTempF minTempF meanTempF raindays rainfall
#> <int> <chr> <chr> <lgl> <chr> <chr> <lgl> <chr> <chr>
#> 1 1 8.3 2.1 NA 47.0 35.8 NA 18.0 124.0
#> 2 2 10.7 2.4 NA 51.3 36.3 NA 14.9 93.0
#> 3 3 13.7 4.2 NA 56.7 39.6 NA 17.6 93.5
#> 4 4 16.3 6.2 NA 61.4 43.1 NA 16.4 69.3
#> 5 5 20.0 9.2 NA 68.0 48.6 NA 13.6 62.7
#> 6 6 23.1 12.0 NA 73.5 53.6 NA 9.2 43.2
#> 7 7 27.0 14.3 NA 80.6 57.8 NA 4.1 16.5
#> 8 8 27.3 14.4 NA 81.1 58.0 NA 3.9 17.0
#> 9 9 24.3 11.7 NA 75.8 53.1 NA 6.7 37.3
#> 10 10 17.7 7.8 NA 63.8 46.0 NA 12.5 76.2
#> 11 11 11.6 4.7 NA 52.8 40.5 NA 19.0 143.0
#> 12 12 7.6 1.8 NA 45.6 35.2 NA 18.6 139.4
#> # … with 1 more variable: climateFromMemDate <chr>
city_ids
data and the
forecast()
function. Adding easier ID search methods, accessing
the climate summary, or better display of the forecast or climate
summary could be extensions.The guts of the package aren’t as easily customized/localized. E.g. most people will end up building the exact same package.
I implemented with the tidyverse which leads to a lot of little
dependency management details to pass check()
. (E.g. using the
.data
pronoun and importing it from rlang).
Depends on the WWIS api staying constant, or we’ll need to update whenever they do.
Provides monthly climate summaries. WWIS appears to use the most recent set of these for their climate summaries, but WWR has summaries for every decade back to 1921-30. The data will require some effort in parsing, but we could provide it parsed and build a package around doing something with the data.
It's not obvious what the functionality should/could be, some options:
Has an API for forecast data, and limited (5-day) historical data. Building a package around the API, is probably too much for this audience. Also, has "Bulk History" download (for a fee), but this provides hourly historical data for a location which we could build a package around. I played with this for Corvallis.
library(corvweather)
weather
#> # A tibble: 377,245 x 22
#> datetime year month temp feels_like temp_min temp_max pressure
#> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1978-12-31 16:00:00 1978 12 267. 260. 266. 269. 1034
#> 2 1978-12-31 17:00:00 1978 12 267. 259. 266. 269. 1033
#> 3 1978-12-31 18:00:00 1978 12 265. 260. 264. 269. 1035
#> 4 1978-12-31 19:00:00 1978 12 263. 257. 263. 264. 1035
#> 5 1978-12-31 20:00:00 1978 12 263. 257. 263. 264. 1035
#> 6 1978-12-31 21:00:00 1978 12 262. 256. 261. 264. 1036
#> 7 1978-12-31 22:00:00 1978 12 262. 256. 261. 263. 1036
#> 8 1978-12-31 23:00:00 1978 12 262. 256. 261. 263. 1036
#> 9 1979-01-01 00:00:00 1979 1 261. 254. 261. 263. 1037
#> 10 1979-01-01 01:00:00 1979 1 262. 256. 261. 264. 1037
#> # … with 377,235 more rows, and 14 more variables: sea_level <lgl>,
#> # grnd_level <lgl>, humidity <dbl>, wind_speed <dbl>, wind_deg <dbl>,
#> # rain_1h <dbl>, rain_3h <dbl>, snow_1h <dbl>, snow_3h <dbl>,
#> # clouds_all <dbl>, weather_id <dbl>, weather_main <chr>,
#> # weather_description <chr>, weather_icon <chr>
Plus some functions that do some kind of summary of this data. (I need to think about these).
Using a different geographical location requires an $USD10 investment. Might be OK for an instructor to do once, but don’t want individual learners having to do this.
I need to work through what some functions might look like. Will these be too complex?
A possible source of daily weather data: https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND
Like the idea above, I imagine we'd provide data for one location and the pacakge would revolve around building summaries of that data (i.e. the monthly climate summary WWIS returns).
We'll discuss this during the meeting, but I'm adding so it's here. Here are my thoughts about the climate data source.
I've looked over some of the options and I think the GHCN option is the best for our needs. Instructors and learners can search and download data for their area with https://www.ncdc.noaa.gov/cdo-web/. If they want to be a bit more challenging, they can even write/use simple API requests with https://www.ncei.noaa.gov/support/access-data-service-api-user-documentation, so the difficulty can be modified as needed. We'd just have to write detailed instructions to the instructors/self-learners on how to use that API properly, since the website documentation doesn't seem that well described. There's also the FTP link too ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/. Anyway, that's my thought.
There is also the package rnoaa for getting GHCN data - I've played with it a bit and it will greatly simplify getting clean data (for us and for other people to customize the assignment).
If we provide a concrete example, what location(s) would we want to use?
This is basically done from the previous phase.
Leave comments, ideas, thoughts here!