Open mlandry22 opened 8 years ago
I'm currently struggling a bit with the difference in what my GBM models are doing vs the median compared to how the public leaderboard shows those same stats. My GBMs take a median of about 2.6 down to 0.2. The public leaderboard takes a median of 24.17106 down to 23.73444.
Yes, many outliers were removed, but the basis movement should be fairly stable. Removing outliers does take the count of records down, so moving 2.4 units will get diluted when we just add a bunch of immovable points. Still, I don't quite get why 2.4 units on training becomes 0.44 on the public board.
Ha, it's because I trained on the log of the target. Yay for diagnostics, even if it takes me three hours to understand.
Tried some probability matching without success. The result basically "downgraded" the values of ~900 in the blended model (which I think came from RF) to the 30-50 range or so. Since it did worse, maybe that means the Rf model or blend has successfully identified some of the outliers?
To Mark's earlier point on our score gap, it doesn't seem like it should be so hard to get a little 0.02 bump.
This is the wrong time to try new things out, but with the Marshall-Palmer being deterministic, it seems possible to do this, so I am going to try:
It's easy to do the straightforward one, at least. Perhaps that will help with some diversity.
MIght as well try it. I'm going to try one more thing with probability matching and see if there are any gains.
So with 6 more submissions remaining, any priority on how to use them?
I kept screwing up my folds just on my own code, but did finally get those out. I want to run both of your code as well, but I am really making a lot of mistakes to get through this. I did not do the MP full data thing yet.
I should also run John's features through my R code. If I'm reading the XGBoost code right, the only use of MAE is in printing the output. feval is the final evaluation, so the part that outputs and controls early stopping. objective = "reg:linear" is the part that calculates the gradient that affects what the trees alter, and that's not an absolute loss (which isn't trivial to implement either).
R's does compute the gradient of MAE, so with the same features, I should be able to get the R model closer to what XGBoost does. It will lose some accuracy due to not column sampling on each tree, but it would seem I could get closer than what we have now.
Am running my best R gbm settings against John's data (as per the one uploaded here, at least) right now. John, what sort of range should I expect, for MAE? It seems the median on this is about 2.1703 on the training set. If what I have does well, I'd like to submit it to the leaderboard in the morning. If one of these four things happens, I won't need the 5th-to-last submission (our second "today")
So if I am not heard of by about noon pacific (deadline - 4 hours), assume I am not getting anything in and somebody can take that second submission.
Finished that but the results are a little suspicious. Low tree counts, like 100-400 perform best, escalating to the worst as 1000 trees, at which they come back down until stopping at 1300. But 1300 is worse than the simpler models on the validation set. That's not very reassuring. Holdout scores are around 22.95 - 22.98, on a median of 23.34. If that drop from the median is mirrored in the test set, it would be good for around the range of the best public script, so not great. Worse, the test set as I calculated it looks quite a bit different from the train set. It's possible that's because the full NA values are removed, so there's nothing to worry about. But it seems like there isn't a compelling reason to submit this today. I'll try to package it up and email it, in case somebody is able to submit on the condition we have nothing better to use at the deadline. Mark
Just now seeing this. I've recently been seeing MAEs from 22.1 to 22.8 on the validation set, which has the Ref=NAs removed. The range has not corresponded to the test set closely enough for me to use it reliably. In other words, I would change something, see a drop of 0.1, submit, and then see a rise of 0.1 or more.
If you're still able to email something I can move it forward.
Sent it, but it's outside that range and it just doesn't seem right that it's more accurate on the soft side. Strange curve I can't remember seeing before. I blended the high end and the low end of trees.
So again, if anybody has anything else, use it. Else at least we have one for today. Will be working on this offline tonight hopefully.
It 'only' got 23.77. I noticed the min value is 0.254, which is higher than most of our models, and the max is ~25 compared to 35 for the xgboost. There's also a spike between 0.64 and 0.9 with about half of all values in that range. I don't know if that's good or not, just more than I would expect.
Not much at all, but got us another 0.004 and 2 places. I started with xgbens-11-04, which upon inspection had a dip in the pd curve that probably shouldn't be there. So fit it to a gamma distribution except that at the tail end, where the gamma started really clipping I kept the original values. Then blended it at 1:2 with our previous best submission - 90-04-03-03 (which probably uses the same xgb).
Yes, it was intentionally conservative, both on the high and low end. Locally, using a floor of 0.254 beat out 0.01 on both models and the blend I used, so I opted for the conservative model, especially being worried about the volatility of the model from the odd complexity shape.
And about the 0.64 - 0.9, not sure why that would be. I noticed that your XGBoost submission has the highest of just about everything on a summary: mean, median, quartiles. MAE isn't something you can clearly shift around to reach an optimal number, so it likely isn't the case that if we just shifted me and Thakur's models first, we'd see better performance. And not enough submissions to tinker in that way either. But chances are good the XGBoost values are decent ones, but I can't quite think of a way to take advantage of that this late.
Very interesting. I noticed the notion of it being a gamma distribution, and really ought to see what H2O can do with that, as it can fit a gamma distribution.
Since the model with the highest complexity was starting to improve its accuracy, I am adding 300 more trees to see what happens. If it gets better, at least I'd have something new to try. I might also pair that with trying to aim for the most common values near each prediction. As usual, if anybody else has anything better, go ahead and take the submission.
Nothing else here.
On Sun, Dec 6, 2015 at 5:10 PM, Mark Landry notifications@github.com wrote:
23.77 model
Yes, it was intentionally conservative, both on the high and low end. Locally, using a floor of 0.254 beat out 0.01 on both models and the blend I used, so I opted for the conservative model, especially being worried about the volatility of the model from the odd complexity shape. And about the 0.64 - 0.9, not sure why that would be. I noticed that your XGBoost submission has the highest of just about everything on a summary: mean, median, quartiles. MAE isn't something you can clearly shift around to reach an optimal number, so it likely isn't the case that if we just shifted me and Thakur's models first, we'd see better performance. And not enough submissions to tinker in that way either. But chances are good the XGBoost values are decent ones, but I can't quite think of a way to take advantage of that this late.
Gamma model
Very interesting. I noticed the notion of it being a gamma distribution, and really ought to see what H2O can do with that, as it can fit a gamma distribution.
Last submission for today
Since the model with the highest complexity was starting to improve its accuracy, I am adding 300 more trees to see what happens. If it gets better, at least I'd have something new to try. I might also pair that with trying to aim for the most common values near each prediction.
As usual, if anybody else has anything better, go ahead and take the submission.
Reply to this email directly or view it on GitHub: https://github.com/mlandry22/rain-part2/issues/3#issuecomment-162363706
Well the model was better. But...it got in 11 seconds too late. So it's 23.76221, which isn't too bad, but a costly submission deadline mistake.
Costly, costly as this is the deadline.So our final best answer is all we have left. Thoughts?
No great ideas here. I suppose it would either be another ensemble for maybe a 0.005 gain or something bolder (I don't know what) that has a shot. Maybe run a long deep model on h2o gbm and dilute it slightly with our current best?
On Sun, Dec 6, 2015 at 6:19 PM, Mark Landry notifications@github.com wrote:
Costly, costly as this is the deadline.So our final best answer is all we have left. Thoughts?
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/rain-part2/issues/3#issuecomment-162379537.
Well I have an interesting gamma model. It appears to test locally pretty well. And something about it seems correct, but that's probably the part of me that hopes it's our 10-spot jump, rather than logical side that thinks that something done the last day is unlikely to be too useful.
Nonetheless, I think that I am going to put a heavy emphasis on this one, just in case.
I'll line up the distributions, and a handful of points and see if I can make heads or tails of it. Starting that now. Plan is a 5-way blend, and we'll get to see the score before we choose it blindly, of course.
Here are the six summaries:
summary(eAll[,3:8])
XGB gamma DL RF rGbm1 rGbm2
Min. : 0.1718 Min. : 0.04505 Min. : 0.2383 Min. : 0.3600 Min. : 0.010 Min. : 0.2540
1st Qu.: 1.0682 1st Qu.: 0.66955 1st Qu.: 0.6901 1st Qu.: 0.8407 1st Qu.: 0.813 1st Qu.: 0.7748
Median : 1.8392 Median : 1.09184 Median : 0.8700 Median : 1.1379 Median : 1.436 Median : 0.8460
Mean : 2.1688 Mean : 1.81431 Mean : 1.4045 Mean : 1.5196 Mean : 1.600 Mean : 1.3383
3rd Qu.: 2.4303 3rd Qu.: 2.20905 3rd Qu.: 1.2667 3rd Qu.: 1.4932 3rd Qu.: 1.706 3rd Qu.: 1.2100
Max. :1100.0000 Max. :39.41440 Max. :66.3600 Max. :33.5551 Max. :29.287 Max. :29.6695
It wasn't a terrible idea. Who knows why, but we bumped up a couple spots. Just a couple, but I guess it's something. I'll just choose the top two scores, I suppose.
Ouch. We fell more than anybody up in our range. We still got a 10% out of it. 27th down to 45th. Sorry about that. Well it isn't what we were hoping for, to be sure. But I believe it's John's best finish, so we can be happy there. Though John did most of the work it's certainly fair to say, so nice job, John!
They've been closing contests fairly shortly in the past while, but I'll submit some files after the deadline to see what the culprit might have been. If nothing else, I'll see the 6 models I used in the ensemble for private scores and post those.
Thanks for the experience, guys! I was psyched to see the small rise up two spots, and likewise disappointed to see the drop, but as Mark mentioned top 10% is a first for me and I hope to make that the bar going forward. Hope to work with you both again sometime!
On Mon, Dec 7, 2015 at 6:04 PM, Mark Landry notifications@github.com wrote:
Ouch. We fell more than anybody up in our range. We still got a 10% out of it. 27th down to 45th. Sorry about that. Well it isn't what we were hoping for, to be sure. But I believe it's John's best finish, so we can be happy there. Though John did most of the work it's certainly fair to say, so nice job, John!
They've been closing contests fairly shortly in the past while, but I'll submit some files after the deadline to see what the culprit might have been. If nothing else, I'll see the 6 models I used in the ensemble for private scores and post those.
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/rain-part2/issues/3#issuecomment-162710399.
It looks like the predicted outliers are doing most of the damage. The main reason that final model helped was that I dialed down the XGBoost contribution to the model.
But, it's more than that. When doing a submission where I cap the XGBoost model at 40 and keep the gamma down (it wasn't a good model), we would have gotten 34th. Oh well. Should have, could have, would have. But, what we did isn't bad. At least it doesn't seem that a super model was within our grasp, so we can be content with that.
John, if you want to see how your models did before they turn over the leaderboard, you can see using this: https://www.kaggle.com/c/how-much-did-it-rain-ii/leaderboard?submissionId=2269690 Only the person who submitted can see it, so I can't see the XGBoost individual models or Thakur's models.
Well, time to put this one to rest. Sorry it took me so much to get going at the end. The models I ran at the end did OK, but the public/private feedback was misleading, so we wouldn't have known which was which. It is frustrating dealing with such variance in the local vs public vs private. Learn and move on, right! Not a bad finish at all. Good work, I think we benefited from everybody, so that is nice.
I think I will have to dial down the teams for a little while, though. I became quite unreliable for a big chunk of this competition, and that was also the case for the Deloitte one, and the Rossmann one. Essentially, once H2O World geared up, I became unavailable and have had trouble getting back in it. Which I'm fine with, but I feel guilty being part of a team. So when I get back into doing team ones (hopefully when AutoML is really working for Kaggle well), I'll try and reach out to see if we can do another. Good luck both of you on the current round! Thanks, again! Mark
p.s. I'll probably either remove this repository or make it public. Only one no vote for making it public is required, so if you don't want this repository made public, let me know. I'll probably do one more round of inquiries before I really go through with either one.
If you didn't see it, the winner posted a fantastic write up. Recurrent Neural Networks, complete with a great and easy to digest explanation. Code coming soon, too.
Beside the novelty of what he was doing from features (pivoting and gap-filling, rather than aggregating), this is an interesting thing to note, complete with rationale (20-day/10-day split):
I began by splitting off 20% of the training set into a stratified (with respect to the number of radar observations) validation holdout set. I soon began to distrust my setup as some models were severely overfitting on the public leaderboard despite improving local validation scores. By the end of the competition I was training my models using the entire training set and relying on the very limited number of public test submissions (two per day) to validate the models, which is exactly what one is often discouraged from doing! Due to the nature of the training and test sets in this competition (see above), I believe it was the right thing to do.
Thread to discuss submission strategy. At 3 members and 2 submissions per day, it won't be too obvious how to go about this.