A repo of interesting datasets

karthik / mozfest-data-lessons

Repo for the data carpentry session at Mozfest

karthik.github.io/mozfest-data-lessons

5 stars 0 forks source link

A repo of interesting datasets #1

Open karthik opened 10 years ago

karthik commented 10 years ago

Although this isn't quite a lesson, it's often a challenge to teach with interesting data. The widely used iris dataset (packaged with R), or diamonds, packaged with ggplot2 aren't that interesting. @jennybc brought the gap minder dataset into the SWC material and that has been fantastic.

So this might be a bit meta, but some folks could spend a bit of time compiling datasets that are both interesting and fun to munge in {language-of-choice}.

jennybc commented 10 years ago

I'm about to turn that Gapminder excerpt into a proper R package and document the (unholy) cleaning it went through. My grad course is providing serious deadlines for both the cleaning and package-ization, so I have no way to really back out or delay indefinitely on this.

I think it would be it's own small data package (?).

Or are you proposing a meta-package holding multiple datasets?

karthik commented 10 years ago

I'm about to turn that Gapminder excerpt into a proper R package and document the (unholy) cleaning it went through.

That would be fantastic.

Or are you proposing a meta-package holding multiple datasets?

I would love for gapminder to be its own data package. In this situation I thought we could compile and document (as in write clear documentation) on a bunch of different but useful datasets that one could simply install and be able to work with. Many courses need this. One e.g. is John Myles White's Rdatasets (standard R datasets as a Julia package: https://github.com/johnmyleswhite/RDatasets.jl)

Jenny: If you have ideas for datasets please suggest here.

karthik commented 10 years ago

Note: We can easily add the gapminder data package as a dependency to this one.

jennybc commented 10 years ago

This is not (yet) an R data package, but I really like this Lord of the Rings Data:

https://github.com/jennybc/lotr

originally from here:

http://www-958.ibm.com/software/data/cognos/manyeyes/datasets/words-spoken-by-character-race-scene/versions/1.txt

karthik commented 10 years ago

Awesome! lotr dataset looks great.

The following came from Kyle Cranmer

Here is a nice list: http://rs.io/100-interesting-data-sets-for-statistics/

CERN is about to release some open data related to the LHC, but the portal is not quite ready: http://opendata.cern.ch

All the best,

Kyle

jennybc commented 10 years ago

The basic gapminder R package now exists:

https://github.com/jennybc/gapminder

I still want to make the cleaning code into compiled notebooks but commented code is already there, along with all the intermediates.

Here is where my grad class STAT 545 has been collecting links to interesting datasets or lists thereof:

https://github.com/STAT545-UBC/Discussion/issues/39