ropensci / unconf15

rOpenSci's San Francisco hackathon/unconf 2015
http://unconf.ropensci.org
36 stars 7 forks source link

So You Think You Can Data #17

Closed karthik closed 8 years ago

karthik commented 9 years ago

At the unconf we're going to have a friendly data challenge (previously discarded names include Iron Data and Data With the Stars) between Hadley Wickham (@hadley) and Wes McKinney (@wesm). This session will be moderated by GitHub's own Arfon Smith (@arfon) and broadcast at a later time.

Actual time TBD: Either Thursday night (03/26) or sometime on Friday (03/27).

In this issue we'd like to solicit ideas for potential datasets and challenges.

cc @gvwilson @tracykteal

wesm commented 9 years ago

Sure to embarrass all involved =) looking forward.

Ironholds commented 9 years ago

The Netflix Challenge? ;)

benmarwick commented 9 years ago

How about this for a potential challenge & dataset: https://github.com/ropensci/unconf/issues/13 ?

brianckeegan commented 9 years ago

Something around long vs. wide data?

jordansread commented 9 years ago

How about all the state and federal data for phosphorous (commonly the limiting nutrient for harmful algal blooms)?

CamDavidsonPilon commented 9 years ago

Some interesting datasets, questions TBD:

rhiever commented 9 years ago

Excited to see more R vs. Python competitions popping up!

I've been kicking around the idea of making a 20 questions game bot powered by a decision tree. Acquiring the data set is the tough one there.

hadley commented 9 years ago

Some thoughts:

jcheng5 commented 9 years ago

I suppose "Data Idol" was too on the nose? ;)

tracykteal commented 9 years ago

We didn't actually think of that one! Maybe that will be the spin off.

tracykteal commented 9 years ago

Since it will be hosted at github, how about something with the github archive data https://www.githubarchive.org (although maybe too close to the GitHub Data Challenge) One idea might be to correlate repo tiltle/keywords with number of contributors. Do different domains/topics have more contributors? More contributors from different geographic locations? Diversity, referring back to #13? I'm not sure the archive has information on what programming language the repo is in, but domain/topic could also be correlated with language, and maybe filtered for domains of interest like 'ecology', 'genomics', 'archaeology', 'astronomy', 'economics'.

tracykteal commented 9 years ago

There's also this https://github.com/caesar0301/awesome-public-datasets

gvwilson commented 9 years ago

Arfon and I talked about using GitHub data - problem is, it's already pretty clean. Now, if it was stuff from CVS repositories, we'd have a game...

dgrtwo commented 9 years ago

I'll propose this United Nations voting data, containing all votes from the history of the United Nations General Assembly (since 1946). I've used this to teach dplyr/tidyr to students before and it's gone very well.

Pros:

Cons:

wesm commented 9 years ago

Couple notes

hadley commented 9 years ago

I'm not that great at stat modelling either :)

I like @dgrtwo's idea