Getting Started - Githubissues

mlandry22 commented 8 years ago

As I mentioned, I use these mainly to keep communication with enhanced features (markdown: code, tables, etc.). You'll probably get emails whenever I update it, but in case things render poorly, the Issue directly on Github will likely have better formatting.

mlandry22 commented 8 years ago

Getting my driver script ready. After thinking through versions that would make sense for the future, I'm just trying to get it done simply using a CSV.

On one hand, I don't figure this one will be too much in the way of hyperparameter tuning. On the other, all models will probably be not great, so maybe a big ensemble will do a good job. Either way, here are some intro thoughts on things we'd want to experiment with. Feel free to directly add to or update this list.

Models
- R GBM: optimizes MAE directly
- Scikit GBM: optimizes MAE directly
- XGBoost: only optimizes it directly if you write the function yourself; and not just as a scoring function; that only keeps track of the best iteration. To use it inside the algorithm, you have to create functions to provide the gradient and hessian (which I think is 0 for MAE?) So...XGBoost may not be an option without some work.
- R QuantReg: MAE random forest package for R. It's quite slow, but it kinda works.
- Binned multinomial classification: might make sense here; I even have the code to do so from the last competition; Adzuna 3rd place solution has always been intriguing because it did so with salaries.
- Others? Solving MAE (quantile regression) is a pretty big deal for these, unless we can find sufficient translations to remove this need (sometimes log will do it, but this time that doesn't seem to be the case)
Preprocessing: target
- Removing outliers (including various cutpoints: 50, 60, 70, 80, 90, 100)
- Capping outliers (same cutpoint discussion)
- Discarding values not in 0.01" increments (may indicate gauge error)
- Exclude IDs with no value for Ref as these will not be scored for the test set
Feature generation - nearly unlimited, but still useful to think of general patterns and then try applying across the board
- Estimated rainfall rate based on reflectivity (Marshall-Palmer, NWS, or similar)
- Ratio or difference of 5x5 values to point value
- Mean
- Squared Mean (this was very useful last time and shows promise here; it allows the strong points to have more weight in the average, and apparently that's a good thing)
- Weighted mean by time interval (using amount of time since last measurement as the weight)
- Max
- Sd or variance
- Count
- Count NAs
Hyperparameters
- Try everything available in each algorithm, more or less
- XGBoost's early stopping is the ideal way to go; R and H2O can add more trees, which is decent, but does require some sort of a loop. but that loop is very worthwhile so that trees doesn't have to be a parameter to tune.

Surely more.

Also we can try predicting the outliers themselves. I don't have a lot of hope there, but it seems reasonable to give it a shot.

If estimated rates based on reflectivity are useful, we might try modifying the function for outliers. Research indicates different conversion factors are appropriate for different levels of rainfall; i.e., mist vs. hurricane.
Might try creating a best-fit gamma distribution and then force outliers to match the distribution?

mlandry22 commented 8 years ago

Also, I haven't put code to run these in parallel in, but it's possible for R. Here is an example:

library(doSNOW)
cl <- makeCluster(rep("localhost",each=4))  ##replace with more processes if desired
registerDoSNOW(cl)
a<-foreach(i=1:10, .combine=rbind) %dopar% runZone_andOutputParallel(i,taskNum)

JohnM-TX commented 8 years ago

Added my thoughts to the main comment here.

mlandry22 commented 8 years ago

Cool. Yes, some good additions. The initial test of my driver seems to be working the way I want it to, so I am adding the parallel piece and then will post that code. Nothing too extreme bit it should allow for use of a csv to drive iterations. It won't look much different from a grid search (e.g. caret), but I will spend the next several days connecting most/all of the options above so we can try various preprocessing measures. It won't be pretty bit hopefully it will be simple enough.

ThakurRajAnand commented 8 years ago

Driver thing sounds cool. I would like to share my experience of using Spearmint for parameter search. It works really good for Random Forest and Extra Trees. I can share how to use it in case any of you are interested.

mlandry22 commented 8 years ago

Oh, great. Yes, bayesian optimization is often where people go. That's a great way to let the computer stay busy, too.

I'm familiar with it from here: http://fastml.com/tuning-hyperparams-automatically-with-spearmint/ But have never used it. We discuss it quite often at H2O.

Related is this paper, that we've been looking over at H2O: http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf

Cool stuff, Thakur. It would be great if you could use scikit's GBM (or anything else scikit) since they support MAE.

mlandry22 commented 8 years ago

Also related: somebody interviewing at H2O used this, which allows you to drive scikit learn with a config file, similar to what I'm trying. https://skll.readthedocs.org/en/latest/ It's surely worth trying these sorts of things. I'll probably stick with mine for the moment: I'm running 6 CSV entries in parallel right now, so it is looking good. But alternatives might prove very useful.

ThakurRajAnand commented 8 years ago

Actually you have to use 2 files by default for Spearmint. Attach are the example files from Spearmint competition for Random Forest.

config.txt --- config.json rf_spearmint.txt --- rf_spearmint.py

config.txt rf_spearmint.txt

ThakurRajAnand commented 8 years ago

You need to have mongoDB installed for using Spearmint. I will make a small document of step-by-step process of Spearmint and will post tomorrow.

mlandry22 commented 8 years ago

Thanks for sharing. This is great to see as I'm trying to create my own--it comes to life a lot more. So you could run spearmint for more decisions than just the hyperparameters, since they're just being connected in your RF.py. You just expose more things in both sides. That's effectively how mine is working.

And I get the MongoDB part, too. I'm just shooting out text files. I really had it continually updating the same CSV, but once I went parallel that is no longer a good idea. The characteristic I want by having it all together rather than separated in different files is to be able to efficiently analyze it. For me a CSV is easy. But a Mongo query is surely easy, too. But a fairly large overhead, unfortunately.

As I get closer to the final vision of my simple thing, I'll talk more about why I want what I want, as far as how that might differ from Spearmint.

mlandry22 commented 8 years ago

My first run's results:

id      val train model trees learn depth minObs rowSample colSample distribution
1  1 2.331730    NA r-gbm   200  0.05     5     10       0.7        NA      laplace
2  2 2.319348    NA r-gbm   200  0.05    10     10       0.7        NA      laplace
3  3 2.336913    NA r-gbm   200  0.05     5      1       0.7        NA      laplace
4  4 2.318312    NA r-gbm   200  0.05    10      1       0.7        NA      laplace
5  5 2.308039    NA r-gbm   200  0.05    15     10       0.7        NA      laplace
6  6 2.305827    NA r-gbm   200  0.05    15      5       0.7        NA      laplace

mlandry22 / rain-part2

Getting Started #1