mlandry22 / rain-part2

How Much Did it Rain Pt 2 - Kaggle Competition
0 stars 0 forks source link

Getting Started #1

Open mlandry22 opened 8 years ago

mlandry22 commented 8 years ago

As I mentioned, I use these mainly to keep communication with enhanced features (markdown: code, tables, etc.). You'll probably get emails whenever I update it, but in case things render poorly, the Issue directly on Github will likely have better formatting.

mlandry22 commented 8 years ago

Getting my driver script ready. After thinking through versions that would make sense for the future, I'm just trying to get it done simply using a CSV.

On one hand, I don't figure this one will be too much in the way of hyperparameter tuning. On the other, all models will probably be not great, so maybe a big ensemble will do a good job. Either way, here are some intro thoughts on things we'd want to experiment with. Feel free to directly add to or update this list.

Surely more.

Also we can try predicting the outliers themselves. I don't have a lot of hope there, but it seems reasonable to give it a shot.

mlandry22 commented 8 years ago

Also, I haven't put code to run these in parallel in, but it's possible for R. Here is an example:

library(doSNOW)
cl <- makeCluster(rep("localhost",each=4))  ##replace with more processes if desired
registerDoSNOW(cl)
a<-foreach(i=1:10, .combine=rbind) %dopar% runZone_andOutputParallel(i,taskNum)
JohnM-TX commented 8 years ago

Added my thoughts to the main comment here.

mlandry22 commented 8 years ago

Cool. Yes, some good additions. The initial test of my driver seems to be working the way I want it to, so I am adding the parallel piece and then will post that code. Nothing too extreme bit it should allow for use of a csv to drive iterations. It won't look much different from a grid search (e.g. caret), but I will spend the next several days connecting most/all of the options above so we can try various preprocessing measures. It won't be pretty bit hopefully it will be simple enough.

ThakurRajAnand commented 8 years ago

Driver thing sounds cool. I would like to share my experience of using Spearmint for parameter search. It works really good for Random Forest and Extra Trees. I can share how to use it in case any of you are interested.

mlandry22 commented 8 years ago

Oh, great. Yes, bayesian optimization is often where people go. That's a great way to let the computer stay busy, too.

I'm familiar with it from here: http://fastml.com/tuning-hyperparams-automatically-with-spearmint/ But have never used it. We discuss it quite often at H2O.

Related is this paper, that we've been looking over at H2O: http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf

Cool stuff, Thakur. It would be great if you could use scikit's GBM (or anything else scikit) since they support MAE.

mlandry22 commented 8 years ago

Also related: somebody interviewing at H2O used this, which allows you to drive scikit learn with a config file, similar to what I'm trying. https://skll.readthedocs.org/en/latest/ It's surely worth trying these sorts of things. I'll probably stick with mine for the moment: I'm running 6 CSV entries in parallel right now, so it is looking good. But alternatives might prove very useful.

ThakurRajAnand commented 8 years ago

Actually you have to use 2 files by default for Spearmint. Attach are the example files from Spearmint competition for Random Forest.

config.txt --- config.json rf_spearmint.txt --- rf_spearmint.py

config.txt rf_spearmint.txt

ThakurRajAnand commented 8 years ago

You need to have mongoDB installed for using Spearmint. I will make a small document of step-by-step process of Spearmint and will post tomorrow.

mlandry22 commented 8 years ago

Thanks for sharing. This is great to see as I'm trying to create my own--it comes to life a lot more. So you could run spearmint for more decisions than just the hyperparameters, since they're just being connected in your RF.py. You just expose more things in both sides. That's effectively how mine is working.

And I get the MongoDB part, too. I'm just shooting out text files. I really had it continually updating the same CSV, but once I went parallel that is no longer a good idea. The characteristic I want by having it all together rather than separated in different files is to be able to efficiently analyze it. For me a CSV is easy. But a Mongo query is surely easy, too. But a fairly large overhead, unfortunately.

As I get closer to the final vision of my simple thing, I'll talk more about why I want what I want, as far as how that might differ from Spearmint.

mlandry22 commented 8 years ago

My first run's results:

id      val train model trees learn depth minObs rowSample colSample distribution
1  1 2.331730    NA r-gbm   200  0.05     5     10       0.7        NA      laplace
2  2 2.319348    NA r-gbm   200  0.05    10     10       0.7        NA      laplace
3  3 2.336913    NA r-gbm   200  0.05     5      1       0.7        NA      laplace
4  4 2.318312    NA r-gbm   200  0.05    10      1       0.7        NA      laplace
5  5 2.308039    NA r-gbm   200  0.05    15     10       0.7        NA      laplace
6  6 2.305827    NA r-gbm   200  0.05    15      5       0.7        NA      laplace