Submission Management - Githubissues

mlandry22 commented 8 years ago

Thread to discuss submission strategy. At 3 members and 2 submissions per day, it won't be too obvious how to go about this.

mlandry22 commented 8 years ago

I'm currently struggling a bit with the difference in what my GBM models are doing vs the median compared to how the public leaderboard shows those same stats. My GBMs take a median of about 2.6 down to 0.2. The public leaderboard takes a median of 24.17106 down to 23.73444.

Yes, many outliers were removed, but the basis movement should be fairly stable. Removing outliers does take the count of records down, so moving 2.4 units will get diluted when we just add a bunch of immovable points. Still, I don't quite get why 2.4 units on training becomes 0.44 on the public board.

mlandry22 commented 8 years ago

Ha, it's because I trained on the log of the target. Yay for diagnostics, even if it takes me three hours to understand.

JohnM-TX commented 8 years ago

Tried some probability matching without success. The result basically "downgraded" the values of ~900 in the blended model (which I think came from RF) to the 30-50 range or so. Since it did worse, maybe that means the Rf model or blend has successfully identified some of the outliers?

To Mark's earlier point on our score gap, it doesn't seem like it should be so hard to get a little 0.02 bump.

mlandry22 commented 8 years ago

This is the wrong time to try new things out, but with the Marshall-Palmer being deterministic, it seems possible to do this, so I am going to try:

Don't aggregate readings until the end.
Apply the Marshall-Palmer across all points of an ID.
Turn that into % of ID
Create Partial-Expected: multiply Expected * %
Learn a model for partial-expected
Try adding those up first, but probably a better way is to have something simple aggregate those, perhaps by having each observation bucketed by time and the model's predictions sorted to fit that matrix.

It's easy to do the straightforward one, at least. Perhaps that will help with some diversity.

JohnM-TX commented 8 years ago

MIght as well try it. I'm going to try one more thing with probability matching and see if there are any gains.

So with 6 more submissions remaining, any priority on how to use them?

mlandry22 commented 8 years ago

I kept screwing up my folds just on my own code, but did finally get those out. I want to run both of your code as well, but I am really making a lot of mistakes to get through this. I did not do the MP full data thing yet.

mlandry22 commented 8 years ago

I should also run John's features through my R code. If I'm reading the XGBoost code right, the only use of MAE is in printing the output. feval is the final evaluation, so the part that outputs and controls early stopping. objective = "reg:linear" is the part that calculates the gradient that affects what the trees alter, and that's not an absolute loss (which isn't trivial to implement either).

R's does compute the gradient of MAE, so with the same features, I should be able to get the R model closer to what XGBoost does. It will lose some accuracy due to not column sampling on each tree, but it would seem I could get closer than what we have now.

mlandry22 commented 8 years ago

Am running my best R gbm settings against John's data (as per the one uploaded here, at least) right now. John, what sort of range should I expect, for MAE? It seems the median on this is about 2.1703 on the training set. If what I have does well, I'd like to submit it to the leaderboard in the morning. If one of these four things happens, I won't need the 5th-to-last submission (our second "today")

Not sure what a good CV value is
CV value is not good
Slow creating test set (only train/validation on the code, as I see)
Other things get in the way before I can submit (heading to a remote cabin tomorrow)

So if I am not heard of by about noon pacific (deadline - 4 hours), assume I am not getting anything in and somebody can take that second submission.

mlandry22 commented 8 years ago

Finished that but the results are a little suspicious. Low tree counts, like 100-400 perform best, escalating to the worst as 1000 trees, at which they come back down until stopping at 1300. But 1300 is worse than the simpler models on the validation set. That's not very reassuring. Holdout scores are around 22.95 - 22.98, on a median of 23.34. If that drop from the median is mirrored in the test set, it would be good for around the range of the best public script, so not great. Worse, the test set as I calculated it looks quite a bit different from the train set. It's possible that's because the full NA values are removed, so there's nothing to worry about. But it seems like there isn't a compelling reason to submit this today. I'll try to package it up and email it, in case somebody is able to submit on the condition we have nothing better to use at the deadline. Mark

JohnM-TX commented 8 years ago

Just now seeing this. I've recently been seeing MAEs from 22.1 to 22.8 on the validation set, which has the Ref=NAs removed. The range has not corresponded to the test set closely enough for me to use it reliably. In other words, I would change something, see a drop of 0.1, submit, and then see a rise of 0.1 or more.

If you're still able to email something I can move it forward.

mlandry22 commented 8 years ago

Sent it, but it's outside that range and it just doesn't seem right that it's more accurate on the soft side. Strange curve I can't remember seeing before. I blended the high end and the low end of trees.

mlandry22 commented 8 years ago

So again, if anybody has anything else, use it. Else at least we have one for today. Will be working on this offline tonight hopefully.

JohnM-TX commented 8 years ago

It 'only' got 23.77. I noticed the min value is 0.254, which is higher than most of our models, and the max is ~25 compared to 35 for the xgboost. There's also a spike between 0.64 and 0.9 with about half of all values in that range. I don't know if that's good or not, just more than I would expect.

JohnM-TX commented 8 years ago

Not much at all, but got us another 0.004 and 2 places. I started with xgbens-11-04, which upon inspection had a dip in the pd curve that probably shouldn't be there. So fit it to a gamma distribution except that at the tail end, where the gamma started really clipping I kept the original values. Then blended it at 1:2 with our previous best submission - 90-04-03-03 (which probably uses the same xgb).

mlandry22 commented 8 years ago

23.77 model

Yes, it was intentionally conservative, both on the high and low end. Locally, using a floor of 0.254 beat out 0.01 on both models and the blend I used, so I opted for the conservative model, especially being worried about the volatility of the model from the odd complexity shape.

And about the 0.64 - 0.9, not sure why that would be. I noticed that your XGBoost submission has the highest of just about everything on a summary: mean, median, quartiles. MAE isn't something you can clearly shift around to reach an optimal number, so it likely isn't the case that if we just shifted me and Thakur's models first, we'd see better performance. And not enough submissions to tinker in that way either. But chances are good the XGBoost values are decent ones, but I can't quite think of a way to take advantage of that this late.

Gamma model

Very interesting. I noticed the notion of it being a gamma distribution, and really ought to see what H2O can do with that, as it can fit a gamma distribution.

Last submission for today

Since the model with the highest complexity was starting to improve its accuracy, I am adding 300 more trees to see what happens. If it gets better, at least I'd have something new to try. I might also pair that with trying to aim for the most common values near each prediction. As usual, if anybody else has anything better, go ahead and take the submission.

JohnM-TX commented 8 years ago

Nothing else here.

On Sun, Dec 6, 2015 at 5:10 PM, Mark Landry notifications@github.com wrote:

23.77 model

Yes, it was intentionally conservative, both on the high and low end. Locally, using a floor of 0.254 beat out 0.01 on both models and the blend I used, so I opted for the conservative model, especially being worried about the volatility of the model from the odd complexity shape. And about the 0.64 - 0.9, not sure why that would be. I noticed that your XGBoost submission has the highest of just about everything on a summary: mean, median, quartiles. MAE isn't something you can clearly shift around to reach an optimal number, so it likely isn't the case that if we just shifted me and Thakur's models first, we'd see better performance. And not enough submissions to tinker in that way either. But chances are good the XGBoost values are decent ones, but I can't quite think of a way to take advantage of that this late.

Gamma model

Very interesting. I noticed the notion of it being a gamma distribution, and really ought to see what H2O can do with that, as it can fit a gamma distribution.

Last submission for today

Since the model with the highest complexity was starting to improve its accuracy, I am adding 300 more trees to see what happens. If it gets better, at least I'd have something new to try. I might also pair that with trying to aim for the most common values near each prediction.

As usual, if anybody else has anything better, go ahead and take the submission.

Reply to this email directly or view it on GitHub: https://github.com/mlandry22/rain-part2/issues/3#issuecomment-162363706

mlandry22 commented 8 years ago

Well the model was better. But...it got in 11 seconds too late. So it's 23.76221, which isn't too bad, but a costly submission deadline mistake.

mlandry22 commented 8 years ago

Costly, costly as this is the deadline.So our final best answer is all we have left. Thoughts?

JohnM-TX commented 8 years ago

No great ideas here. I suppose it would either be another ensemble for maybe a 0.005 gain or something bolder (I don't know what) that has a shot. Maybe run a long deep model on h2o gbm and dilute it slightly with our current best?

On Sun, Dec 6, 2015 at 6:19 PM, Mark Landry notifications@github.com wrote:

Costly, costly as this is the deadline.So our final best answer is all we have left. Thoughts?

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/rain-part2/issues/3#issuecomment-162379537.

mlandry22 commented 8 years ago

Well I have an interesting gamma model. It appears to test locally pretty well. And something about it seems correct, but that's probably the part of me that hopes it's our 10-spot jump, rather than logical side that thinks that something done the last day is unlikely to be too useful.

Nonetheless, I think that I am going to put a heavy emphasis on this one, just in case.

I'll line up the distributions, and a handful of points and see if I can make heads or tails of it. Starting that now. Plan is a 5-way blend, and we'll get to see the score before we choose it blindly, of course.

mlandry22 commented 8 years ago

Here are the six summaries:

summary(eAll[,3:8])
      XGB                gamma                DL                RF              rGbm1            rGbm2        
 Min.   :   0.1718   Min.   : 0.04505   Min.   : 0.2383   Min.   : 0.3600   Min.   : 0.010   Min.   : 0.2540  
 1st Qu.:   1.0682   1st Qu.: 0.66955   1st Qu.: 0.6901   1st Qu.: 0.8407   1st Qu.: 0.813   1st Qu.: 0.7748  
 Median :   1.8392   Median : 1.09184   Median : 0.8700   Median : 1.1379   Median : 1.436   Median : 0.8460  
 Mean   :   2.1688   Mean   : 1.81431   Mean   : 1.4045   Mean   : 1.5196   Mean   : 1.600   Mean   : 1.3383  
 3rd Qu.:   2.4303   3rd Qu.: 2.20905   3rd Qu.: 1.2667   3rd Qu.: 1.4932   3rd Qu.: 1.706   3rd Qu.: 1.2100  
 Max.   :1100.0000   Max.   :39.41440   Max.   :66.3600   Max.   :33.5551   Max.   :29.287   Max.   :29.6695

mlandry22 commented 8 years ago

It wasn't a terrible idea. Who knows why, but we bumped up a couple spots. Just a couple, but I guess it's something. I'll just choose the top two scores, I suppose.

mlandry22 commented 8 years ago

Ouch. We fell more than anybody up in our range. We still got a 10% out of it. 27th down to 45th. Sorry about that. Well it isn't what we were hoping for, to be sure. But I believe it's John's best finish, so we can be happy there. Though John did most of the work it's certainly fair to say, so nice job, John!

They've been closing contests fairly shortly in the past while, but I'll submit some files after the deadline to see what the culprit might have been. If nothing else, I'll see the 6 models I used in the ensemble for private scores and post those.

JohnM-TX commented 8 years ago

Thanks for the experience, guys! I was psyched to see the small rise up two spots, and likewise disappointed to see the drop, but as Mark mentioned top 10% is a first for me and I hope to make that the bar going forward. Hope to work with you both again sometime!

On Mon, Dec 7, 2015 at 6:04 PM, Mark Landry notifications@github.com wrote:

Ouch. We fell more than anybody up in our range. We still got a 10% out of it. 27th down to 45th. Sorry about that. Well it isn't what we were hoping for, to be sure. But I believe it's John's best finish, so we can be happy there. Though John did most of the work it's certainly fair to say, so nice job, John!

They've been closing contests fairly shortly in the past while, but I'll submit some files after the deadline to see what the culprit might have been. If nothing else, I'll see the 6 models I used in the ensemble for private scores and post those.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/rain-part2/issues/3#issuecomment-162710399.

mlandry22 commented 8 years ago

It looks like the predicted outliers are doing most of the damage. The main reason that final model helped was that I dialed down the XGBoost contribution to the model.

But, it's more than that. When doing a submission where I cap the XGBoost model at 40 and keep the gamma down (it wasn't a good model), we would have gotten 34th. Oh well. Should have, could have, would have. But, what we did isn't bad. At least it doesn't seem that a super model was within our grasp, so we can be content with that.

John, if you want to see how your models did before they turn over the leaderboard, you can see using this: https://www.kaggle.com/c/how-much-did-it-rain-ii/leaderboard?submissionId=2269690 Only the person who submitted can see it, so I can't see the XGBoost individual models or Thakur's models.

mlandry22 commented 8 years ago

Well, time to put this one to rest. Sorry it took me so much to get going at the end. The models I ran at the end did OK, but the public/private feedback was misleading, so we wouldn't have known which was which. It is frustrating dealing with such variance in the local vs public vs private. Learn and move on, right! Not a bad finish at all. Good work, I think we benefited from everybody, so that is nice.

I think I will have to dial down the teams for a little while, though. I became quite unreliable for a big chunk of this competition, and that was also the case for the Deloitte one, and the Rossmann one. Essentially, once H2O World geared up, I became unavailable and have had trouble getting back in it. Which I'm fine with, but I feel guilty being part of a team. So when I get back into doing team ones (hopefully when AutoML is really working for Kaggle well), I'll try and reach out to see if we can do another. Good luck both of you on the current round! Thanks, again! Mark

p.s. I'll probably either remove this repository or make it public. Only one no vote for making it public is required, so if you don't want this repository made public, let me know. I'll probably do one more round of inquiries before I really go through with either one.

mlandry22 commented 8 years ago

If you didn't see it, the winner posted a fantastic write up. Recurrent Neural Networks, complete with a great and easy to digest explanation. Code coming soon, too.

http://simaaron.github.io/Estimating-rainfall-from-weather-radar-readings-using-recurrent-neural-networks/

mlandry22 commented 8 years ago

Beside the novelty of what he was doing from features (pivoting and gap-filling, rather than aggregating), this is an interesting thing to note, complete with rationale (20-day/10-day split):

I began by splitting off 20% of the training set into a stratified (with respect to the number of radar observations) validation holdout set. I soon began to distrust my setup as some models were severely overfitting on the public leaderboard despite improving local validation scores. By the end of the competition I was training my models using the entire training set and relying on the very limited number of public test submissions (two per day) to validate the models, which is exactly what one is often discouraged from doing! Due to the nature of the training and test sets in this competition (see above), I believe it was the right thing to do.

mlandry22 / rain-part2

Submission Management #3

23.77 model

Gamma model

Last submission for today

23.77 model

Gamma model

Last submission for today

As usual, if anybody else has anything better, go ahead and take the submission.