Open mlandry22 opened 9 years ago
I think for default strategy, we can try this: John and Thakur get 1/day and if I am paying atttention as the deadline rolls around (will soon be 4pm pacific, a good time on weekdays) I will claim one.
I think that default strategy will work pretty well because my plan for this competition is to have a large variety of attempts available, so I'll probably try and post what the holdout scores are as I create them, and then we can decide if we ever want to use one. I won't be able to submit everything I generate anyway, so I will allow you guys to submit what you want.
So we can discuss/post here to try and plan better, but unless it's clear that we have something interesting going on, you each can have one a day, I think.
Sounds good to me.
It works for me. I'll try to give a heads up if not using one.
Lately I've been sending a submission a few hours after the previous day's deadline. So, I have already submitted for "today" and we have one remaining.
I have had one ready for whenever there was a free slot. It's a semi-useless blend, but something I wanted to try. But I see you have since bumped up performance of the XGB contribution, so I'll probably take the (deadline - 10 minutes) slot for some kind of blend there, unless either of you have something specific you want to test that will gain is more insight. If so, by all means, go ahead.
Nice going, John. I put your best and my single best together at 70/30 and we bumped up 2 spots.
We are "leading the league" in at least one category: most submissions! In fact, we are probably just under the max allowable for teaming up. The competition has been running since 9/17, so the max submissions is around 82 or 84, and we are at 79.
Glad we made it under the limit and popped up a couple spots. I have some new developments that should be good. Will tie them into the main thread and post code in the next 24 hours.
Didn't get us better position tonight, but improved the XGB score to 23.7687, just .0015 below our current best.
Nice. That makes it our leading single model, so surely we'll improve on any combination. But I will resist the urge to do such a thing until we're next out of ideas. I'll try and have something ready from R, in case Thakur is still working on his by the deadline tomorrow. Thakur, if you have something, that 2nd one is all yours.
John, are you using some sort of validation set to gauge local improvement as well? No problem if you aren't. We'll want to eventually, but it's still plenty early.
I will set up sklearn GBM + spearmint to run and whatever best parameter spearmint is able to find before deadline will submit that. Since I will do 5 fold so might be able to report CV scores also.
\ Any suggestions for doing proper validation or anything you seems to be matching closely with LB
Yes I have a basic validation set in use. I carved out 1/5 for validation and use the other 4/5 for training (keeping ID integrity). It seems to be directionally correct and is in the neighborhood of agreement with the public LB (low 20's).
I find that if I run CV on xgboost, those scores are much lower (between 2-3). I think it's because outliers are removed before running the model, and a big part of the MAE comes from those mysterious outliers.
@JohnM914 I am still in office and would reach home after 1 hour only.Not sure If I would be able to generate submission on time. Feel free to use the submission if you have something ready.
Anyone have anything good to submit. I just started playing around and can try submitting a H2O NN if you guys don't have something good to test.
I think it's all yours Thakur. John got his in last night, so take your shot. I will let you know when I have something very useful. My scripting is working, but I need to deploy it to a system where I'm comfortable letting it go for hours on end. So far it's on my everyday laptop. Step by step. Will have something I like when it's all done.
LB Score : 23.82631
I would say not bad at all, since NN was very simple. Below are the parameters I have used.
I would be saving code for each of my submission in new file and would keep naming convention of my submission and code file same. e.g T0001.R and T0001.csv
Awesome! I'll make sure and add some h2o.deeplearning configs in the thing that runs all weekend so we can see if that configuration can be beat. Don't let that stop you from doing the same. But that's a perfect utilization of what I want, and will be nice to let some H2O servers run all weekend.
Guess who's in the top 10, boys? Us!! Code is deposited in the folder. I'll be taking a break until Sunday night for family time and won't be submitting for a few days. Plus, it's time for me to take a break from the terminal and view the problem from a distance for a short while. Have a great weekend!
Awesome!! Ok, I said awesome twice. But adding deep learning to the mix and whatever John just did to get us up 12 spots is great. Will take a look at what it was.
WOW
Have a great weekend, John. I've seen the code and now I see the following notes in your pair of submissions:
Fri, 30 Oct 2015 04:16:50
more features 95/5 blend with sample non-tuned model
xgb-10-29-4.csv
23.74293
...
Thu, 29 Oct 2015 03:47:52
added new features 78/22 blend with sample
xgb-10-27-4.csv
23.76872
So this is looking great. We have an R gbm, XGBoost GBM, and H2O deep learning, all closing in on similar independent scores, so we are in good shape for some nice variance in our models. At some point, we'll probably want to look into stacking these: learning weights, rather than the guess-and-check simple average method we're doing (which is fine for now).
Hi guys - I made a new submission tonight with better local CV but it did not do well on the LB. It was the same XGb model, with one bug fix to a calculated field, and blended 85/15 or so with all zeros (essentially scaling down the values.) I thought it would do better, but no.
I'll look back to features and see if there are gains to be made there unless there is another avenue we need to explore.
Did some work on features, cutpoints, etc. and got .10-.15 gain locally but nothing on the LB. Finally got some small gain with an ensemble of the two best xgb models. I'm thinking this branch of development is about played out and will spend my next effort trying to identify some of those outliers that contribute to the bulk of MAE.
Hi guys. Do either of you have plans for the second submission today? If not, I'll use it but no problem waiting til later either.
On Fri, Oct 30, 2015 at 12:25 AM, Mark Landry notifications@github.com wrote:
Have a great weekend, John. I've seen the code and now I see the following notes in your pair of submissions:
Fri, 30 Oct 2015 04:16:50 more features 95/5 blend with sample non-tuned model xgb-10-29-4.csv 23.74293 ... Thu, 29 Oct 2015 03:47:52 added new features 78/22 blend with sample xgb-10-27-4.csv 23.76872
So this is looking great. We have an R gbm, XGBoost GBM, and H2O deep learning, all closing in on similar independent scores, so we are in good shape for some nice variance in our models. At some point, we'll probably want to look into stacking these: learning weights, rather than the guess-and-check simple average method we're doing (which is fine for now).
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/rain-part2/issues/3#issuecomment-152419623.
No plan from my side
Go ahead, John. Thanks for pushing us forward. I'll catch up once we get our conference over next wek :-)
No problem. Unfortunately my last two submissions were bricks! But, I might be onto something. Will post it as an issue and maybe you guys can build/ course correct when things settle down.
On Thu, Nov 5, 2015 at 3:59 PM, Mark Landry notifications@github.com wrote:
Go ahead, John. Thanks for pushing us forward. I'll catch up once we get our conference over next wek :-)
Reply to this email directly or view it on GitHub: https://github.com/mlandry22/rain-part2/issues/3#issuecomment-154208773
Cool. Please post it whenever you are ready. I recently left job and was busy doing knowledge transfer, but now I am completely free. I will go through all the features you and Mark have created and will start running some models.
Trying deep learning out now. Bigger models than Thakur listed here are not working at all. So I'm going back and starting with just what you had running. Could be my features aren't good as they are, too.
I tried more tinkering with features but couldn't get any improvements. Also ran through with a h2o RF model but still in the same range. I'll go back to looking for outliers to see if we have any luck there. In this strangely-placed post https://www.kaggle.com/c/how-much-did-it-rain-ii/forums/t/17317/just-curious-what-s-the-mouse-over-number-on-scores the comp admin seems to indicate that it's not possible to find the outliers in the test data.
So what do we want as an overall strategy for the last week? I made some (very) slight progress with the outlier detection tonight and can pursue that. I've tried more feature engineering with rainfall prediction, but hit many dead ends. One of the main challenges I've had is flaky validation. I'm doing a simple leave out 20% of the data which sometimes works sometimes doesn't. Open to suggestions...
Anyway, how should we proceed? I think with just some ensembling and simple tweaks we can move up a few spots and who knows, maybe we pull it together and jump back to top 10?
The logistic regression isn't off to a great start. It bottomed out shortly after. I'm trying to figure out if it can still be informative, particularly when stacked with something like nlopt for MAE. But let's not hold our breath.
Therefore, back to John's question about the overall strategy. And I like the two concepts mentioned.
For flaky validation, I can provide some horsepower to run 5-fold, rather than a single 20%. We can compare validation scores across each and see how that works out.
That will also give us some ensembling direction. And with what we have, that's probably a good idea, though I know it didn't work out too well on the first attempt (again, I have had that same experience before, too).
Other ideas?
As it stands, I think perhaps we can share a list of folds so that we can consistently use the same validation sets? That way we can all consistently use the same first fold if that's all we want to start with, and have it be consistent.
Yes, I agree sharing validation sets would be a good move. Some interesting traffic today on the forum: https://www.kaggle.com/c/how-much-did-it-rain-ii/forums/t/16680/cross-validating Several others have had trouble with classification carrying over from train to test. There's something different about those days that throws us off. For outlier detection, I tried to stay away from "rainfall-related" variables and look at "site specific" variables like radardist, sd(Zdr) and such but still no luck.
Thakur - any insight from outlier detection or other pursuits? I could use some inspiration!
Ok, so John, I'm going to use this as a basis for creating the folds, so that scores should be on par with what you're using, and then we can get 4 other sets to start using:
# create an interim validation set
idnumsv <- unique(trraw[, Id])
validx <-sample(1:length(idnumsv), length(idnumsv)/5)
valraw <- trraw[Id %in% validx, ]
trraw <- trraw[!Id %in% validx, ]
val <- collapsify(valraw)
val <- val[!is.na(val[, wref]), ]
setDF(val)
That comes with a seed set up top which should ensure the sampling is consistent. I'll do that and post a 2-column table of IDs and folds. Then I'll try and adapt our best code and run them all on 5-fold.
Speaking of, I'll try and catch up with everything to answer this question myself, but our leading models are something like
Others I've overlooked as I was distracted for a few weeks?
Sorry for the delay but I am back. Have been bit busy with some other stuff. I will pick features created by you and John and start improving NN and will try to optimize GBM from sklearn directly on mae.
Since I was away, I will be happy if your guys assign something.
@JohnM914 instead of trying to find the outliers directly did you try to generate classification CV probabilities and use them as features in regression model. I am going to give one try to auto-encoders to do outlier detection after I leave office today.
I briefly went down the path with k-means but looking back probably didn't do it right. I drew up a schematic earlier today that I can try with classification feeding into regression. In theory it should work but it looks like many others have tried and failed. Probably because they're not as smart as us, right? Joking of course. I'll give it a go in the next 24 hours and see anyway. Brain is wearing out this evening - midnight my time...
On Tue, Dec 1, 2015 at 11:52 PM, Thakur Raj Anand notifications@github.com wrote:
@JohnM914 https://github.com/JohnM914 instead of trying to find the outliers directly did you try to generate classification CV probabilities and use them as features in regression model. I am going to give one try to auto-encoders to do outlier detection after I leave office today.
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/rain-part2/issues/3#issuecomment-161188500.
Here is my latest on features that might predict outliers, which by my way of thinking are different than features to predict ranfiall. I tried to focus on features that would be specific to a site - radar calibration, geography, local interference, timing intervals, etc.
collapsify2 <- function(dt) {
dt[, .(expected = mean(Expected, na.rm = T)
, bigflag = mean(bigflag, na.rm = T)
, negflag = sum(negflag, na.rm = T)
, records = .N
, timemean = mean(timespans, na.rm = T)
, timesum = sum(timespans, na.rm = T)
, timemin = min(timespans, na.rm = T)
, timemax = max(timespans, na.rm = T)
, timesd = sd(timespans,na.rm = T)
, minssum = sum(minutes_past, na.rm = T)
, minsmax = max(minutes_past, na.rm = T)
, minssd = sd(minutes_past, na.rm = T)
, zdrmax = max(Zdr, na.rm = T)
, zdrmin = min(Zdr, na.rm = T)
, zdrsd = sd(Zdr, na.rm = T)
, kapsd = sd(Kdp*radardist_km, na.rm = T)
, rhosd = sd(RhoHV, na.rm = T)
, rhomin = min(RhoHV, na.rm = T)
, rd = mean(radardist_km, na.rm = T)
, refcdivrd = max((RefComposite-Ref)/radardist_km, na.rm = T)
, c1 = max(Zdr/Ref, na.rm = T)
, c2 = max(RefComposite/Ref, na.rm = T)
, c3missratio = sum(is.na(RhoHV))/.N
, refmissratio = sum(is.na(Ref))/.N
, refcmissratio = sum(is.na(RefComposite))/.N
), Id]
}
Here's what I got for feature importance. As mentioned, this did well locally but not with the test set.
John, on that same idea, we might try to compute peculiar single readings. Like if rhomin is way out of range and everything else is in normal range. That's a hard feature for trees to find, despite what people claim about interactions. We'd probably want normalized values of each reading from the pre-collapsed set: (reading-colMean)/colSD And then something like getting the max value per ID divided by the mean value per ID (accent the outlier). There's a couple things to look for, and playing around with numbers can probably get a few such features.
Things you don't want this to flag:
Things you would want this to flag:
Likely an important part about making this work efficiently is to make a single flag for all features; either:
Another thing that seems silly but worthwhile, if it's there, is whether they have internally inconsistent numbers. If the 10th is greater than the 50th or 50th greater than 90th. Not likely the case, but if it happens, it might be interesting to look at.
...I think this verifies that no silly mistakes are out there.
t<-fread("train.csv")
t[!is.na(Ref_5x5_10th) & !is.na(Ref_5x5_50th),sum(Ref_5x5_10th>Ref_5x5_50th)]
t[!is.na(Ref_5x5_90th) & !is.na(Ref_5x5_50th),sum(Ref_5x5_50th>Ref_5x5_90th)]
t[!is.na(RefComposite_5x5_10th) & !is.na(RefComposite_5x5_50th),sum(RefComposite_5x5_10th>RefComposite_5x5_50th)]
t[!is.na(RefComposite_5x5_90th) & !is.na(RefComposite_5x5_50th),sum(RefComposite_5x5_50th>RefComposite_5x5_90th)]
t[!is.na(RhoHV_5x5_10th) & !is.na(RhoHV_5x5_50th),sum(RhoHV_5x5_10th>RhoHV_5x5_50th)]
t[!is.na(RhoHV_5x5_90th) & !is.na(RhoHV_5x5_50th),sum(RhoHV_5x5_50th>RhoHV_5x5_90th)]
t[!is.na(Zdr_5x5_10th) & !is.na(Zdr_5x5_50th),sum(Zdr_5x5_10th>Zdr_5x5_50th)]
t[!is.na(Zdr_5x5_90th) & !is.na(Zdr_5x5_50th),sum(Zdr_5x5_50th>Zdr_5x5_90th)]
t[!is.na(Kdp_5x5_10th) & !is.na(Kdp_5x5_50th),sum(Kdp_5x5_10th>Kdp_5x5_50th)]
t[!is.na(Kdp_5x5_90th) & !is.na(Kdp_5x5_50th),sum(Kdp_5x5_50th>Kdp_5x5_90th)]
And then here is a simple way to get started with looking for the weird stuff:
library(data.table)
t<-fread("train.csv")
t[,zRef:=minutes_past*0]
t[,zRefComposite:=minutes_past*0]
t[,zRhoHV:=minutes_past*0]
t[,zZdr:=minutes_past*0]
t[,zKdp:=minutes_past*0]
t[!is.na(Ref),zRef:=scale(Ref, center = TRUE, scale = TRUE)[,1]]
t[!is.na(RefComposite),zRefComposite:=scale(RefComposite, center = TRUE, scale = TRUE)[,1]]
t[!is.na(RhoHV),zRhoHV:=scale(RhoHV, center = TRUE, scale = TRUE)[,1]]
t[!is.na(Zdr),zZdr:=scale(Zdr, center = TRUE, scale = TRUE)[,1]]
t[!is.na(Kdp),zKdp:=scale(Kdp, center = TRUE, scale = TRUE)[,1]]
t[,maxZ:=pmax(zRef,zRefComposite,zRhoHV,zZdr,zKdp)]
t[,meanZ:=(zRef+zRefComposite+zRhoHV+zZdr+zKdp)/5]
t[,meanNonZeroZ:=(zRef+zRefComposite+zRhoHV+zZdr+zKdp)/(1+ifelse(zRef==0,0,1)+ifelse(zRefComposite==0,0,1)+ifelse(zRhoHV==0,0,1)+ifelse(zZdr==0,0,1)+ifelse(zKdp==0,0,1))]
t[,maxAbsZ:=pmax(abs(zRef),abs(zRefComposite),abs(zRhoHV),abs(zZdr),abs(zKdp))]
t[,meanAbsZ:=(abs(zRef)+abs(zRefComposite)+abs(zRhoHV)+abs(zZdr)+abs(zKdp))/5]
t[,meanAbsNonZeroZ:=(abs(zRef)+abs(zRefComposite)+abs(zRhoHV)+abs(zZdr)+abs(zKdp))/(1+ifelse(zRef==0,0,1)+ifelse(zRefComposite==0,0,1)+ifelse(zRhoHV==0,0,1)+ifelse(zZdr==0,0,1)+ifelse(zKdp==0,0,1))]
t2<-t[,.(maxRatio=max(ratioMaxAbs_meanAbs),Expected=mean(Expected)),Id]
t2[,.(median=median(Expected),mean=mean(Expected),meanCapped=mean(pmin(Expected,50)),.N),round(maxRatio,1)][order(round)]
But doing all that doesn't seem convincing at first:
round median mean meanCapped N
1: 0.0 0.7620004 326.2158641 12.6169167 410096
2: 1.3 0.5080003 0.7257147 0.7257147 7
3: 1.4 0.5080003 87.5228872 8.8886983 109
4: 1.5 0.5080003 172.9487308 7.5987424 3490
5: 1.6 0.7620004 63.0267645 4.9785835 25942
6: 1.7 1.0160005 56.0237884 4.9542465 16917
7: 1.8 1.0160005 47.6335496 4.8691934 15032
8: 1.9 1.0160005 48.6061699 5.0389237 10765
9: 2.0 0.7620004 75.2349484 5.7090180 36310
10: 2.1 1.0160005 41.9078220 4.9771725 18475
11: 2.2 1.0160005 41.8893130 4.8625946 19920
12: 2.3 1.0160005 40.0369639 4.7955911 19967
13: 2.4 0.7620004 39.2674235 4.3239085 19865
14: 2.5 1.0160005 37.2729321 4.5524655 34591
15: 2.6 1.0750006 27.6358265 4.5071043 37173
16: 2.7 1.2700007 23.6880710 4.6437057 48376
17: 2.8 1.2700007 23.7996848 4.7349457 44136
18: 2.9 1.2700007 24.2110894 4.6691191 66051
19: 3.0 1.2700007 18.1062962 4.4207018 37647
20: 3.1 1.2700007 20.9359522 4.2279012 32788
21: 3.2 1.2700007 19.5954574 4.2232663 31539
22: 3.3 1.2700007 18.2001923 4.1939458 30618
23: 3.4 1.2700007 15.8351303 3.8901064 29082
24: 3.5 1.2700007 17.1309199 3.9404645 27212
25: 3.6 1.2700007 16.4194614 3.8531756 24567
26: 3.7 1.0160005 18.8579310 3.9361624 22868
27: 3.8 1.2700007 12.7421878 3.8746778 19899
28: 3.9 1.0160005 12.8914775 3.7394227 17744
29: 4.0 1.2700007 9.4873132 3.7121392 14757
30: 4.1 1.2700007 14.7490870 3.7254200 12581
31: 4.2 1.2700007 12.2639715 3.5607097 10848
32: 4.3 1.2700007 14.3822789 3.6191568 9120
33: 4.4 1.2700007 18.9320663 4.0976536 7676
34: 4.5 1.2650007 18.3044052 3.7352882 6296
35: 4.6 1.2700007 14.6096468 3.6498686 4871
36: 4.7 1.0160005 16.4558360 3.7568519 3906
37: 4.8 1.2700007 16.3944929 3.8032243 2927
38: 4.9 1.2700007 12.0857711 4.0340287 2205
39: 5.0 1.2700007 5.3350087 3.4537775 1540
40: 5.1 1.2700007 17.3513730 4.2926483 1088
41: 5.2 1.0750006 38.6896190 4.4416716 761
42: 5.3 1.2700007 23.8237763 4.3680016 537
43: 5.4 1.0160005 3.8097636 3.0975536 344
44: 5.5 1.5240008 7.0909679 4.4143251 167
45: 5.6 1.0160005 92.3979459 5.2137943 87
46: 5.7 0.6350003 1.3112734 1.3112734 44
47: 5.8 3.5560020 3.9793354 3.9793354 3
48: 5.9 0.2540001 0.2540001 0.2540001 1
round median mean meanCapped N
Not an elegant way of getting it done, but here is a way to get folds that should be consistent with what John has been doing. John's data should check out with fold 0.
I'll try and get the CSV zipped and posted, but might email it. Here is R code to get it.
library(data.table)
library(readr)
library(dplyr)
set.seed(333)
trraw <- fread("train.csv", select = selection)
idnumsh <- unique(trraw[, Id])
s<-sample(1:length(idnumsh))
s2<-floor(s/((1+length(idnumsh))/5))
foldTable<-as.data.frame(idnumsh)
foldTable$fold<-s2
colnames(foldTable)[1]<-"Id"
write.csv(foldTable,"foldTable.csv",row.names=F)
I'm running my code through those five folds now. I won't have an opportunity to submit by today's deadline. But I ought to have all five models done with plenty of time for tomorrow's.
Cool. I may have something in time. Setting up my outliers detector and classifier now.
It's nice that you've been pushing the best single model. I was going to get an ensemble set up in case we didn't have two for the deadline.
Are either of you using the Marshall Palmer directly? It's likely not a bad feature, but I don't think I'm using it.
Well I went ahead and tried the ensemble given that it seems like we have either 0 or 1 other submissions today. It helped, but not much. As expected, mostly.
At least it's some progress. I ran the classifier based on flags along the lines of what Mark described. It worked very well on the validation set but as before, it did not do any good on the test set.
I'll finally leave this avenue and go back to trying probability matching. I didn't get any boost fitting to a standard gamma distribution, but an ensemble approach might work. The woman who wrote this paper seems to think it has merit. Ebert_2001_PME.pdf
Mark - I used rates calculated by Marshall-Palmer as a feature for xgboost and it was ranked #1 for feature importance. I used this modified version after trying different parameters in a direct calculation of MAE on the training data.
ratemm = ((10^(Ref/10))/170) ^ (1/2)
precipmm = sum(timespans * ratemm, na.rm = T) # grouped by Id
where timespans is the fraction of the hour from the previous reading to the current reading associated with the Ref value.
OK, got the folds run last night. But I might have underestimated the variance in what we all aim to predict as well. The ID code I sent should be universal, so no exclusions. But John capped at 50, I think I saw, I had previously capped at 70, I think. So we'll be covering different spaces a little bit. I suppose if we do any stacking, we'll only do it on those where we have full predicitons, and that's likely the best anyway.
So that said, here is a statement regarding the variance of my folds, which is fairly large, I think:
1: 0.2035 2: 0.2067 3: 0.2001 4: 0.2031 5: 0.2014
So shifting at the third decimal place, which isn't too bad--we just bumped our score up by a similar amount of the variance here and it was only worth one spot.
Tough to know what to make of it, but I can provide predictions for the full holdout set I was using and then the test set.
For a little inspiration, the amount we moved on our last submission is almost 1/4 of what we need to get 10th, as it stands. So it's certainly achievable. Now we just need some clue of how to do it 4.5 more times ;-)
Thread to discuss submission strategy. At 3 members and 2 submissions per day, it won't be too obvious how to go about this.