A contemporary Iris dataset package

vsbuffalo commented 9 years ago

This has been something on my todo list for a while: an R package that's a collection of useful teaching datasets — in a way, a contemporary Fisher's Iris dataset. The idea is to have a bunch of carefully chosen different datasets that facilitate learning how to work with large data and its issues. For example:

genomics datasets that teach students to look for technical confounders, learn different ways of visualizing large amounts of data, and use *apply() functions.
a very, very messy CSV and/or tab-delimited file in inst/extdata (e.g. data a human has entered in to Excel) that works as an example of how to load in and clean data with problems. For example, the data should be littered with different NA strings ("NA", "N/A", "Na") that convert columns to factors, there should be outliers due to decimals in the wrong position, etc.

In general, the package in this data set should be a hotbed of problems the average R user will encounter. I think such a package would greatly help in R teaching sessions. Ideally the package would have a vignette of solutions — and maybe even multiple vignettes that show how to solve problems with different tools.

Myfanwy commented 9 years ago

I second this

andeek commented 9 years ago

+1, this sounds like a great idea.

lmullen commented 9 years ago

I think this would be very worthwhile, even more so if the package included datasets of interest to multiple disciplines. I started doing something related for history with the historydata package.

karthik commented 9 years ago

This would be amazing! @jennybc did a great job with the gap minder. This would be super useful for data carpentry and swc as well. Cc @tracykteal

jordansread commented 9 years ago

:+1:

richfitz commented 9 years ago

:+1: - always a need for these! Another potential source of interesting and available datasets: public transport usage (e.g., sydney). Pretty sure there will be other transport nerds around.

ledell commented 9 years ago

:+1: Great idea. The UCI Machine Learning Repository is a decent source of datasets. Most of these are already clean, but some of these datasets could be used to demonstrate common data transformation tasks, if that fits the scope of your package.

kellieotto commented 9 years ago

Very keen on this idea

sckott commented 9 years ago

@ledell somebody was/is still? going to contribute ucipp to ropensci (placeholder https://github.com/ropensci/ucipp), their repo https://github.com/lpfgarcia/ucipp

karthik commented 9 years ago

They've dropped off the radar but that was the plan for a while.

ledell commented 9 years ago

@sckott and @karthik - Thanks for the heads up about the ucipp repo. I know that CRAN will not allow packages above 100MB; does rOpenSci have an opinion on the size of R packages? I'm curious about the limitations for including a bunch of datasets in one package.

richfitz commented 9 years ago

If the data is too big for the package, we could instead ship scripts that download the packages into place (using github.com/hadley/rappdirs to keep things organised/platform independent). So we could ship pointers to big data sets, keeping the main download light.

karthik commented 9 years ago

I think another approach might be to just break down these data packages either by domain or application. It's not a huge burden for a course partipant to install a few data packages install.packages(c('foo1, 'foo2', ...)) as necessary.

The problem with downloading from a location is that we are expecting it to be persistent and reachable. At least with CRAN we can expect that a fast local mirror (or rstudio) will have a copy readily available.

richfitz commented 9 years ago

Is that harder than

install.packages("amazing_data")
amazing_data::fetch(c("domain1", "domain2"))

This approach would also have the advantage of keeping the git repo small and light.

karthik commented 9 years ago

This approach would also have the advantage of keeping the git repo small and light.

Didn't say it was harder. Aren't we expecting to push this to CRAN? Where will the data reside?

richfitz commented 9 years ago

If the data already have canonical sources, then presumably we don't need to re-host them? Though for small datasets (<1MB) probably anything is OK.

tracykteal commented 9 years ago

This is a great idea and would be very useful to Data Carpentry, Software Carpentry and other courses and workshops, like the Reproducible Research one. I agree with @karthik to break down the data packages, so only the relevant ones can be installed. Could we host them at some public data repository, with the added benefit that we would be showing how to work with public datasets (using for example the rOpenSci packages dataone or rfigshare) and they could be used more easily in non-R curriculum?

stephaniehicks commented 9 years ago

great idea!

dholstius commented 9 years ago

Check out https://github.com/holstius/promisedat for an alternative to pkg::fetch_data() approach ... turns out it's legit to bundle promises in lieu of "actual" package data. Examples in promisedat pkg are just promises to read CSV files from inst/extdata, but could just as easily be promises to download & parse large datasets. Curious to get feedback! (and apologies for crashing the party)

ebressert commented 9 years ago

Seriously an awesome idea. We're definitely in need of something like this.

noamross commented 9 years ago

Two thoughts:

Is there a compilation of existing data packages? There does not appear to be a task view.
Since there are so many things to teach about data import, it would make sense to create a get_raw_data_file() function that copies the raw data files from inst/extdata to the working directory? system.file() is a bit obscure for introductory teaching.

srvanderplas commented 9 years ago

I have a dataset of OKCupid profile data (no names, just profile answers, etc.) which we've used in R courses to teach plyr and reshape. It was obtained by scraping profile data, but might be possible to get OKC to release a small sample dataset. Students seem to enjoy using it - there are enough interesting fields (orientation, gender, height, social values, etc.) and data cleaning opportunities that it makes a very interesting dataset for teaching R use.

karthik commented 9 years ago

@srvanderplas That would be awesome! I think it would appeal to a diverse crowd, the same way Jenny's gapminder did.

srvanderplas commented 9 years ago

The problem is that the dataset I have somewhat possibly maybe violates the user agreement? I would want to make sure that we're aboveboard ethically by including it.

I also have a dataset of craigslist ads, but it's a bit sparse as they got better IP banning filters about halfway into my script optimization :).

On Thu, Mar 26, 2015 at 5:55 PM, Karthik Ram notifications@github.com wrote:

@srvanderplas https://github.com/srvanderplas That would be awesome! I think it would appeal to a diverse crowd, the same way Jenny's gapminder did.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/unconf/issues/21#issuecomment-86746394.

ebressert commented 9 years ago

Another interesting data resource: Everpix. It was a startup that failed for various reasons. They provide the nitty gritty details of their company. The data formats are varied in cleanliness as well.

karthik commented 9 years ago

The problem is that the dataset I have somewhat possibly maybe violates the user agreement? I would want to make sure that we're aboveboard ethically by including it.

Good point. I totally forgot that it would easily violate their TOS (but I didn't see anything in a quick scan https://www.okcupid.com/legal/terms).

@ebressert The everpix data looks amazing!

drisso commented 9 years ago

I didn't read the whole thread so maybe somebody already mentioned this, but Sandrine just pointed out these packages on CRAN: http://cran.r-project.org/web/packages/mlbench/index.html http://cran.r-project.org/web/packages/ElemStatLearn/index.html

Might be worth having a look at these.

kellieotto commented 9 years ago

I really like this dataset for comparing causal inference/matching methods. It's already clean.

http://sekhon.berkeley.edu/matching/GerberGreenImai.html

vsbuffalo commented 9 years ago

@gmbecker, @ebressert, I and a few others were discussing the criteria for dataset inclusion. Lots of folks have suggested great datasets, but I think we should try to avoid making this a dump of all datasets out there, and instead more carefully curate its contents.

I think a good criteria is that every dataset should teach the user one specific R topic. For example, the built in UCBAdmissions dataset teaches individuals about Simpson's paradox. Similarly, the datasets in this package should teach similar concepts, both statistical and "data science" oriented (data cleaning, visualization, etc.).

Some ideas that come to mind:

Further Simpson's paradox examples
a dataset where multivariate regression reveals a masked relationship
where shrinkage is beneficial
coloring by confounder; exploring data visually
relationships as revealed by Cleveland's coplots
visualizing interactions
data cleaning (the messy excel file)
spotting outliers

benmarwick commented 9 years ago

For those interested in the history of science, there's the HistData package. It "provides a collection of small data sets that are interesting and important in the history of statistics and data visualization. The goal of the package is to make these available, both for instructional use and for historical research. Some of these present interesting challenges for graphics or analysis in R."

Includes:

Arbuthnot: Arbuthnot’s data on male and female birth ratios in London from 1629-1710 Bowley: Bowley’s data on values of British and Irish trade, 1855-1899 Cavendish: Cavendish’s 1798 determinations of the density of the earth ChestSizes: Quetelet’s data on chest measurements of Scottish militiamen CushnyPeebles: Cushny-Peebles data: Soporific effects of scopolamine derivatives Dactyl: Edgeworth’s counts of dactyls in Virgil’s Aeneid DrinksWages: Elderton and Pearson’s (1910) data on drinking and wages Fingerprints: Waite’s data on Patterns in Fingerprints Galton: Galton’s data on the heights of parents and their children GaltonFamilies: Galton’s data on the heights of parents and their children, by family Guerry: Data from A.-M. Guerry, "Essay on the Moral Statistics of France" Jevons: W. Stanley Jevons’ data on numerical discrimination Langren: van Langren’s data on longitude distance between Toledo and Rome Macdonell: Macdonell’s data on height and finger length of criminals, used by Gosset (1908) Michelson: Michelson’s 1879 determinations of the velocity of light Minard: Data from Minard’s famous graphic map of Napoleon’s march on Moscow Nightingale: Florence Nightingale’s data on deaths from various causes in the Crimean War OldMaps: Latitudes and Longitudes of 39 Points in 11 Old Maps PearsonLee: Pearson and Lee’s 1896 data on the heights of parents and children classified by gen- der PolioTrials: Polio Field Trials Data on the Salk vaccineProstitutes Parent-Duchatelet’s time-series data on the number of prostitutes in Paris Pyx: Trial of the Pyx Quarrels: Statistics of Deadly Quarrels Snow: John Snow’s map and data on the 1854 London Cholera outbreak Wheat: Playfair’s data on wages and the price of wheat Yeast: Student’s (1906) Yeast Cell Counts ZeaMays: Darwin’s Heights of Cross- and Self-fertilized Zea May Pairs

jennybc commented 9 years ago

To push a bit on @vsbuffalo's "one dataset:one topic" proposal … it can be very useful to have one dataset that allows you to teach multiple topics. There's a payoff from getting to know one specific dataset and then working with it for a while, i.e. multiple sessions within a two-day workshop or across several weeks of a course. Totally agree that it's useful to articulate what the dataset is good for. If that list has more than one topic, then we rejoice.

I think @noamross's idea of a Task View for data packages is a great one. It allows for this curation discussed above and provides a clickable annotated listing of good datasets for different purposes.

ebressert commented 9 years ago

I agree with everyone's points. Following in tow with @jennybc's and @vsbuffalo's topics, we may want to sort the data we would want to use in a hierarchical fashion. For example, if we consider messy and clean data, they both serve very different purposes if being considered for ML.

messy data -- exploratory data analysis -- modeling
                    \                   \-- descriptive stats
                     \-- graphics

clean data -- exploratory data analysis -- modeling
                                        \-- descriptive stats

If we wanted to teach someone R and modeling, then having them munge through messy data would be beside the main objective. But if we wanted to teach someone data science, then munging and modeling messy data is perfect.

vsbuffalo commented 9 years ago

Agree with @jennybc — definitely should be at least one specific topic. The more topics the merrier; each little dataset should have it's own twists and turns!

@ebressert I really like this idea — but maybe instead of hierarchy we could do trait tags of data? That way the user could ask for combinations, like messy dataset + regression, or maybe machine learning + visualization.

mine-cetinkaya-rundel commented 9 years ago

@andeek and I are working on a package with data from rotten tomatoes and imdb. The goal is to include both messy and cleaner version of the data, as well as code for scraping and matching so people can use off the shelf or recreate as movies get updated. This is one example of how the dataset can be used for teaching multiple linear regression, but lots more can be done with it as well.

ebressert commented 9 years ago

The new organization contemporary data fit for modern day data science is now called Data Alloy and can be found here.

ropensci / unconf15

A contemporary Iris dataset package #21