Closed karthik closed 8 years ago
Sure to embarrass all involved =) looking forward.
The Netflix Challenge? ;)
How about this for a potential challenge & dataset: https://github.com/ropensci/unconf/issues/13 ?
Something around long vs. wide data?
How about all the state and federal data for phosphorous (commonly the limiting nutrient for harmful algal blooms)?
Some interesting datasets, questions TBD:
Excited to see more R vs. Python competitions popping up!
I've been kicking around the idea of making a 20 questions game bot powered by a decision tree. Acquiring the data set is the tough one there.
Some thoughts:
I suppose "Data Idol" was too on the nose? ;)
We didn't actually think of that one! Maybe that will be the spin off.
Since it will be hosted at github, how about something with the github archive data https://www.githubarchive.org (although maybe too close to the GitHub Data Challenge) One idea might be to correlate repo tiltle/keywords with number of contributors. Do different domains/topics have more contributors? More contributors from different geographic locations? Diversity, referring back to #13? I'm not sure the archive has information on what programming language the repo is in, but domain/topic could also be correlated with language, and maybe filtered for domains of interest like 'ecology', 'genomics', 'archaeology', 'astronomy', 'economics'.
There's also this https://github.com/caesar0301/awesome-public-datasets
Arfon and I talked about using GitHub data - problem is, it's already pretty clean. Now, if it was stuff from CVS repositories, we'd have a game...
I'll propose this United Nations voting data, containing all votes from the history of the United Nations General Assembly (since 1946). I've used this to teach dplyr/tidyr to students before and it's gone very well.
Pros:
Cons:
Couple notes
I'm not that great at stat modelling either :)
I like @dgrtwo's idea
At the unconf we're going to have a friendly data challenge (previously discarded names include Iron Data and Data With the Stars) between Hadley Wickham (@hadley) and Wes McKinney (@wesm). This session will be moderated by GitHub's own Arfon Smith (@arfon) and broadcast at a later time.
Actual time TBD: Either Thursday night (03/26) or sometime on Friday (03/27).
In this issue we'd like to solicit ideas for potential datasets and challenges.
cc @gvwilson @tracykteal