ropensci / unconf17

Website for 2017 rOpenSci Unconf
http://unconf17.ropensci.org
64 stars 12 forks source link

Alternatives to mtcars #91

Closed sfirke closed 7 years ago

sfirke commented 7 years ago

(shoutout to https://twitter.com/tamaramunzner/status/743857012476280833)

Does anyone have a go-to versatile sample data set, or favorite package of data sets? I don't find that mtcars or any of the other built-ins like ToothGrowth or CO2 meet my needs. I know nycflights13::flights and ggplot2::diamonds but would like a data.frame with:

  1. Content area/topic that is as universally accessible and intuitive as possible, should not require explaining
  2. Has a mix of all kinds of column classes, including character vectors
  3. Has some dirtiness to it (maybe even a clean and dirty version of the same thing?)
  4. Generally optimized for teaching and StackOverflow-type examples or slideshow-type demos

Nice if it could also be used for machine learning (that's less important to me but would increase general usefulness). I'd also consider using it for unit tests in packages (I use mtcars in places but it's limited).

Does such an all-purpose demo data set exist that I'm missing?

sckott commented 7 years ago

possibly have a look at https://github.com/rudeboybert/fivethirtyeight (many datasets) via @rudeboybert and @ismayc

ismayc commented 7 years ago

Many worked out examples of analyses of these datasets by my students are here. (The source R Markdown files are on GitHub here.) Further analysis and examples are in my free DataCamp course here and as vignettes for our fivethirtyeight package here.

batpigandme commented 7 years ago

Content area/topic that is as universally accessible and intuitive as possible, should not require explaining.

@sfirke For me, at least, sports data fits the bill here, but I'm pretty sure that's not universal. (Then again, plenty of people don't know the difference between SOHC and DOHC, and that doesn't stop them from using mtcars.)

cboettig commented 7 years ago

Of course there's the excellent gapminder data package from @jennybc , the README points to some good examples of teaching tidyverse from her course.

karthik commented 7 years ago

This exact ideas was an unconference topic two years ago. Mine Centinkaya-Rundel, Eli Bessert and one other person had this goal of creating a collection of interesting datasets (spanning various domains and appropriateness for teaching topic x). It even had a website (I can try to dig up any work they did, including past links).

The titanic dataset (https://www.kaggle.com/c/titanic/data) is fun and relatable for a basic data manipulation and data viz classes. It's quite relatable to a broad audience (iris on the other part is not particularly exciting, even for biologists. It helps that it is a nice even dataset with 50 rows/species and comes baked into base R).

This is the best archive I know of and use quite regularly → https://github.com/caesar0301/awesome-public-datasets

rudeboybert commented 7 years ago

The quickest way to browse thru the fivethirtyeight package data sets is via the "Data Sets" section of the package vignette. If you have thoughts on ways to improve, contact me, @ismayc , or @jchunn. Thanks!

sfirke commented 7 years ago

Between the wealth of examples (thanks all!) and this having been tackled at a past unconference, I'm closing this ✅