ropensci / ozunconf17

Website for 2017 rOpenSci Ozunconf
http://ozunconf17.ropensci.org/
24 stars 6 forks source link

predictive modelling competitions #22

Open goldingn opened 7 years ago

goldingn commented 7 years ago

kaggle is a company that runs predictive modelling competitions on behalf of organisations. Competitors are given a dataset with covariates and response variables which they can use to train a model; they then use the model to make predictions for a new dataset (for which they only have the predictors, not the response variables) and submit these predictions to a web platform. The web platform compares the predictions with the withheld data and posts a score on a leaderboard. At the end of the competition, the winner gets a prize and the organisation gets the model and code to produce it. It's pretty cool.

Predictive modelling competitions are also really useful to organisations and research communities that don't have the funds to use Kaggle or similar commercial platforms, e.g. for resolving disputes about methodology (something I want to do with zoon), or for education (I have run something similar in an undergrad practical session). An R package that makes it easy to set up simple, free, self-hosted competitions like this could be really handy.

The main technical requirement is setting up a server (just an r session running on a web-connected computer) to host the hidden validation dataset, calculate the evaluation scores for each new submission, and serve a leaderboard on the web. The package could use plumber (or jug or OpenCPU or something) to create the API for submission, create a shiny app for the leaderboard and to host the training data to download, and provide users with streamlined functions to submit predictions.

So the organiser might do something like:

run_competition(title = "predict the weights of these guinea pigs",
                description = "build a model that predicts the weights of these loveable balls of
                               fluff from some metadata about them",
                training_data = "train_guinea_pig_features_weights.Rdata",
                test_data = "test_guinea_pig_features.rds"
                secret_test_labels = "test_guinea_pig_weights.csv",
                metric = "RMSE")
Your competition and leaderboard is live and hosted at:
  http://128.250.4.119/8000

Competitors could also use the package to submit their predictions to the leaderboard:

submit_prediction(predicted_weights,
                  website = "http://128.250.4.119/8000",
                  user = "nick",
                  password = "averysecurepassword1")
goldingn commented 7 years ago

I'm imagining users would be registered manually, since setting up a safe and secure automatic registration system would be a whole other can of worms.

dicook commented 7 years ago

Yihui wrote me a little system almost 10 years ago, before kaggle in class was available. it worked beautifully in the class. I don't know that I could find the code again. I think the difficult thing was that it was difficult to hide the true solution, so anyone with a bit of hacking skill could cheat. It seems possible with a shiny app, and doesn't seem too difficult to code.

goldingn commented 7 years ago

Oh cool, that code would be helpful!

Yeah, I thought about ways of doing this without a web service. The only other option I can think of (that would effectively hide the data) is distributing compiled code. And that doesn't sound like a good idea!

dicook commented 7 years ago

I think the simplest is to compare predictions with the true values, using one of a collection of metrics provided. But you'd want to be able to split the test data into a public and private, so that only the performance on public sample is reported until the end of a competition.