Closed mlandry22 closed 9 years ago
Okay guys,
I did find something to at least bring to your attention. In the cookie_basic.csv, there appears to be a small little cluster all by itself in the bottom left corner. I'm not sure if these clusters are considered to be nuisance data or there might be something here.
I basically looked at Anonymous 5,6 and 7 since these were readily numerical and applied PCA to them. When you start to carefully evaluate them, most of them have 0 in anonymous_7. I'm going to upload to a csv file that has the index position of the rows in question along with the Drawbridge_id and cookie_id.
I'm going to look back on our submissions and see what the prediction was for these IDs.
Rob
I grabbed our latest submission and checked if any of the cookie_ids from the lower left were present...none were present. Going to walk the drawbridge id to device to see if anything shakes loose.
Okay, I used the drawbridge id from the lower left and linked up with the training set. There were 42 common drawbridge ids present. Basically, it looked like our model failed to train on these. Since our submission didn't have any of the values in the lower left, we could pick up a few positions if we were to maybe build a model for these and then submit?
Nice job, Sudalai. As you expected, we bumped up a spot.
11 ↑15 H2O.ai & SRK Team 0.853283
You improved on your best score by 0.002802.
You just moved up 1 position on the leaderboard.
I've been working on crosswalking the unknown devices to other devices with a known drawbridge ID. I'm almost done, but I didn't want to risk not getting a reasonable submission in. There are a ton of devices that share IPs, so choosing 1 or more will not be easy. I'm planning on trying it by count of shared IPs to get something available. Maybe it will be useful, but there seem like too many choices available for it to be easy to get it right without doing a proper model.
Rob, your graph looks cool. Those stand out a ton. Can you explain what the takeaway is?
I think this, but correct me if I am wrong. PCA against cookies showed some obvious points that don't look anything like the majority of the data. Our submission included none of these cookies. But some of those cookies have a drawbridge handle. What would "fail to train" mean? I don't think we know if they are "claimed" by devices in the train set? Do they share IPs with any devices in the test set?
Interested, just don't know what the action is. Always cool to see clear outliers like that. Do you know if there are particular fields that drive them there?
Mark,
I'll try my best to explain. Since I was striking out with the supervised clustering, I went with the unsupervised approach. I intentionally focused on the cookie's because it would allow me to bridge and check the drawbridge id with device.
When I plotted the first 2 scores, the lower bottom just jumped out. Meaning that there is something in the data (columns Anonymous 5-7) that was not acting like the rest of the population. I was curious so I looked closely at Anonymous 7, and it appears that the values in the lower left cluster have 0 in that position. Very different than the others. There might be something else that is driving that but that is the one that looked the most prominent to me. Basically, you are tracking correctly with your first few statements.
I then got really curious on if any of these points from the lower cluster was included in any of our submissions, and none of them were. To make things even more intriguing, I referenced the drawbridge id with the validation test set, and 42 of them were in the validation set.
So depending on how we created our training/validation splits, we might have missed this little cluster at the bottom. It is also possible that the current model has actually discarded that entire split in the decision tree because it is so different from the rest of the population. I suspect that since this little cluster is so highly segregated, we could use this cluster to build another model and try to squeak out a few more points.
I am not sure if these are claimed by devices in the train set; however, what would be interesting is if any of you guys have the latest XGboost model up to run the index (i.e. row number) and see what probability is returned.
Did that make sense?
Rob
Thanks Mark and Rob.
Rob, the cluster is so different and if we could include them in our submissions, will be very useful I think.
Today am stuck here with some unexpected personal works till evening after which I need to travel back to chennai tonight and so I think I cannot do anything today. I am extremely sorry. I know it is most important time in the competition but I have no options. Very sorry guys. Please apologize !
On Sunday, August 23, 2015, Rob C. notifications@github.com wrote:
Mark,
I'll try my best to explain. Since I was striking out with the supervised clustering, I went with the unsupervised approach. I intentionally focused on the cookie's because it would allow me to bridge and check the drawbridge id with device.
When I plotted the first 2 scores, the lower bottom just jumped out. Meaning that there is something in the data (columns Anonymous 5-7) that was not acting like the rest of the population. I was curious so I looked closely at Anonymous 7, and it appears that the values in the lower left cluster have 0 in that position. Very different than the others. There might be something else that is driving that but that is the one that looked the most prominent to me. Basically, you are tracking correctly with your first few statements.
I then got really curious on if any of these points from the lower cluster was included in any of our submissions, and none of them were. To make things even more intriguing, I referenced the drawbridge id with the validation test set, and 42 of them were in the validation set.
So depending on how we created our training/validation splits, we might have missed this little cluster at the bottom. It is also possible that the current model has actually discarded that entire split in the decision tree because it is so different from the rest of the population. I suspect that since this little cluster is so highly segregated, we could use this cluster to build another model and try to squeak out a few more points.
I am not sure if these are claimed by devices in the train set; however, what would be interesting is if any of you guys have the latest XGboost model up to run the index (i.e. row number) and see what probability is returned.
Did that make sense?
Rob
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133772874.
So at the end of the day, we have to determine a match versus a non match. How do you see this cluster helping with that? How would we know which device they go to?
I think it makes sense from a linear perspective that PCA essentially didn't cover these records. The normal number space is nice and neat: 135:204, with every number covered. Volume per number is about 10k-16k until you get into the upper records. So PCA probably fit some fairly large coefficients to that feature, covering all but this 0.02%, and 0 doesn't play nicely in the linear world.
anonymous_7 | range of cookie counts |
---|---|
0 | 596 |
135-143 | 10-22 |
144 - 182 | 10,000 - 16,000 |
183 - 199 | 19,000 - 40,000 |
200 | 48,530 |
201 | 66,343 |
202 | 116,380 |
203 | 213,460 |
204 | 802,620 |
They are also all 0 in anonymous_6, but that number space is about the same: 1-204, but the heaviest bucket there is 1, so our 0's are closer. The distribution of anonymous_6 looks awfully suspicious. It's backwards from anonymous_7: 1 is where most records are. But the distribution is quite strange from 4 - 81. Every 7 we have about 20,000, until 81, where the number space is continuous up to 204 again, but with decreasing volume.
Also, this cluster is all missing country (-1), but that's more common.
So now I'm intrigued by the meaning of 204. It seems like a familiar number.
Mark and Sudalai,
Yea, the clusters are strange. I can envision this cluster playing out in two ways now: 1) It truly is a unique and significant cluster that is playing in on one scoring 2) it's a distractor meant to add noise to the model, which impacts performance
Mark, you're right about the linear properties of PCA, it's trying to find a linear manifold that best explains all the data. Here we find several different linear manifolds with that one unique one in the lower left.
Can you guys see the new folder I created that has a CSV file in it? One way to at least test this cluster is to remove the cluster values and rerun. I have marked all the rows that belong to that cluster.
Chaining a response...I am making an assumption that XG returns a probability like logistic...if not, that is me being naive to how the method works. One last thing, please don't let this be a distractor if you guys are solidly working on another alternative.
None of those cookies generated a hit in the ip table. So we have no connection to them and a device. Don't thin it will be possible to use them, for good or bad.
Mark,
Maybe I am not able to follow what you did, but when I used the points within the cluster, I was able to find their ip association by taking the csv file I uploaded and then doing a join on the id_all_ip.csv.
id_dict_key dev_or_cook_indic ip freq_count idxcip_c1 idxcip_c2 idxcip_c3 idxcip_c4 idxcip_c5 index drawbridge_handle 0 id_85107 1 [ip20288984] [1] [0] [1] [1] [0] [1] 2099700 handle_2038552 1 id_4391583 1 [ip3384628] [1] [0] [1] [1] [1] [1] 1723700 handle_250430 2 id_4708279 1 [ip21003511] [1] [0] [0] [1] [1] [0] 1885200 handle_38615 3 id_3022375 1 [ip14990805] [1] [0] [0] [1] [0] [0] 1028300 handle_1871932 4 id_2154467 1 [ip15247134] [2] [0] [2] [1] [0] [1] 587390 handle_490981
Sudalai or Mark, what are your thoughts that these points are being used for the hold-out set?
Rob
OK, I will check why it came up empty for me.
I got us one spot by creeping up a tiny, tiny bit using device:device IP matching for the id_10 set we had.
[image: Inline image 1]
Mark
On Sun, Aug 23, 2015 at 3:27 PM, Rob C. notifications@github.com wrote:
Mark,
Maybe I am not able to follow what you did, but when I used the points within the cluster, I was able to find their ip association by taking the csv file I uploaded and then doing a join on the id_all_ip.csv.
id_dict_key dev_or_cook_indic ip freq_count idxcip_c1 idxcip_c2 idxcip_c3 idxcip_c4 idxcip_c5 index drawbridge_handle 0 id_85107 1 [ip20288984] [1] [0] [1] [1] [0] [1] 2099700 handle_2038552 1 id_4391583 1 [ip3384628] [1] [0] [1] [1] [1] [1] 1723700 handle_250430 2 id_4708279 1 [ip21003511] [1] [0] [0] [1] [1] [0] 1885200 handle_38615 3 id_3022375 1 [ip14990805] [1] [0] [0] [1] [0] [0] 1028300 handle_1871932 4 id_2154467 1 [ip15247134] [2] [0] [2] [1] [0] [1] 587390 handle_490981
Sudalai or Mark, what are your thoughts that these points are being used for the hold-out set?
Rob
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133948030.
Awesome and good job! Let me just upload the file to GIT to save you some time...If I remember corectly, I stole some of Sudalai's code and modified it to unwrap the goofy nested tupples within the bag.
I've got the IPs; I think I aliased the wrong field originally. 596 cookies 42 devices that share drawbridge IDs 2058 matches in IP table, leading to 26540 devices. So for training, we would have 42 positives and 26498 negatives for a 0.15% match rate, which is about 1/20th the typical match rate. It would seem reasonable, when seeing anonymous_7 show up 0 to ignore the cookie, all things considered.
a7<-cookies[anonymous_7==0,]
setnames(a7,c("drawbridge_handle","device_or_cookie_id",colnames(a7)[3:ncol(a7)]))
badIps<-merge(a7,ips,by="device_or_cookie_id")
badDevices<-merge(a7[drawbridge_handle!="-1",],devices,by="drawbridge_handle",allow.cartesian = TRUE)
badIpDevices<-merge(badIps,deviceIps,by="ip")
badIpDevices[,length(unique(device_id))] ## 26540
Thoughts? Seems like the training is fine to me.
If anything else, it might seem like these might teach us what IPs to avoid. Whatever IPs are causing the IP:device explosion might be nice to ignore entirely.
That sounds reasonable. Squeak out another 1/100 of a point
On Aug 23, 2015, at 6:31 PM, Mark Landry notifications@github.com wrote:
If anything else, it might seem like these might teach us what IPs to avoid. Whatever IPs are causing the IP:device explosion might be nice to ignore entirely.
— Reply to this email directly or view it on GitHub.
I put in something simple that was related to what I had done previously. Slightly worse. It expanded the number of handles used, so it probably decreased the precision, but possibly increased recall, though not enough. It was a thin improvement, the first time, so not too reliable anyway.
What would be better would be to try to actually train it. I didn't have time this time, but I have the scripts, so I'll try and see how well that works, if I can. I'll also try and think how I can get the data on a real server. AWS didn't work on the first attempt, so I ditched it. But it would be nice to have bigger processing. I hold the keys, but I can't figure out how to use them!
We are so close, we surely don't need a big idea. And we might even benefit from any small leaderboard shuffling, since we're at the bottom of the cluster.
Nonetheless, I'm always trying to find something I can understand rather than modeling harder. I will be spending as much time on this as I can manage and will post any ideas. Probably not worth following me on these last ones: sure to be a "worst of Mark's thoughts" type thing as I grasp at straws. But 4th place just looks so close!
Weird ideas I will chase to some degree:
2 submissions left. Luckily as of now, I think we have our "best" modeling in place at all parts of the process, so we don't really have to balance anything.
Mark
I'll try to continue hammering away at different ways to look at the data. I'm going to play w the train file to see if there is another natural grouping.
I can also help get us spun up on an AWS ec2. I have a near high speed 1gig at home also.
I'm wondering if we can dump all these bottom clusters as one large device id cookie id?
Rob
On Aug 23, 2015, at 7:07 PM, Mark Landry notifications@github.com wrote:
I put in something simple that was related to what I had done previously. Slightly worse. It expanded the number of handles used, so it probably decreased the precision, but possibly increased recall, though not enough. It was a thin improvement, the first time, so not too reliable anyway.
What would be better would be to try to actually train it. I didn't have time this time, but I have the scripts, so I'll try and see how well that works, if I can. I'll also try and think how I can get the data on a real server. AWS didn't work on the first attempt, so I ditched it. But it would be nice to have bigger processing. I hold the keys, but I can't figure out how to use them!
We are so close, we surely don't need a big idea. And we might even benefit from any small leaderboard shuffling, since we're at the bottom of the cluster.
Nonetheless, I'm always trying to find something I can understand rather than modeling harder. I will be spending as much time on this as I can manage and will post any ideas. Probably not worth following me on these last ones: sure to be a "worst of Mark's thoughts" type thing as I grasp at straws. But 4th place just looks so close!
Weird ideas I will chase to some degree:
is there any structure to the ordering of the drawbridge handles and/or device IDs and/or cookie IDs what is up with those 0-204 columns? What is the relationship between anon-6 and anon-7 columns. With one counting up and the other down, it just seems too tempting. try and find unnatural pairs of things in the known positives; similar to where Rob is headed, I just want to take that one more step to look at the matches between cookie and device and see if I see anything our model might have trouble finding. 2 submissions left. Luckily as of now, I think we have our "best" modeling in place at all parts of the process, so we don't really have to balance anything.
— Reply to this email directly or view it on GitHub.
Thanks Rob and Mark!
I will also try and hammer the data and see, if there are ways to improve our score further.
Good news from me. I headed in to the office and figured out my problem. Needed sftp instead of ftp. So we'll have a nice machine to use for tonight. I'll try and run things on 200/30 when the data and scripts are all available.
Then I'll let that work and will start looking for mysterious crop patterns in our data.
That is an awesome news Mark :) If we could get that done, I am sure we will improve upon few more places. Let the machine learn and we shall try to find some patterns :)
On Mon, Aug 24, 2015 at 10:25 AM, Mark Landry notifications@github.com wrote:
Good news from me. I headed in to the office and figured out my problem. Needed sftp instead of ftp. So we'll have a nice machine to use for tonight. I'll try and run things on 200/30 when the data and scripts are all available.
Then I'll let that work and will start looking for mysterious crop patterns in our data.
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-134036561.
I hope so. Here is the order I'm going to process the files. I'll grab the ensemble code and run it, but this will get me to the point I am currently at if everything goes well. If you get a chance to check this, that would be great, but I'm sure I"ll get it working.
cd SRK
date
python splitDevValData.py
date
python get_DV.py
cd ../sub12
date
python getIDV.py - Code to create the initial set of variables.
date
python getIDV_IP.py - This is to produce intermediate files to form IP related variables
date
python getIDV_IP2.py - This is to create variables realted to IP for dev, val and test
date
python doSampleAndMerge.py - This is to sample the dev file and reduce its size and then merge the variable csv files. It also merges the variable csv files of val and test sample.
date
python buildModel_xgb.py
date
python getMetric.py
cd ../SRK/SecondLevelModel
date
python dataPrep_secondModel.py - This will create the data for second level model using predcitions and drawbridge handle
date
## I think this is where I need to manually break up the dev/val to match my files
python secondModel.py - This will build an xgb model on the new data after doing a dev val split
date
python getMetric_secondModel.py - This will get the error metrics.
## "New val sample" F0.5 score is ~0.8508 using the above method
Thanks Mark. This is perfect. One minor thing is that when running the "buildModel_xgb.py", use the following params
params = {}
params["objective"] = "binary:logistic"
params["eta"] = 0.15
params["min_child_weight"] = 30
params["subsample"] = 0.7
params["colsample_bytree"] = 0.6
params["scale_pos_weight"] = 0.8
params["silent"] = 1
params["max_depth"] = 5
params["max_delta_step"]=2
params["seed"] = 0
params['eval_metric'] = "auc"
num_rounds = 550
This is slightly better than the original params. Thank you.
OK, I will. And the part to get the metric the first of two times is redundant, but I'll like seeing how it does for a sanity check.
Yes.. As you said, it wont hurt and it wont take much time as well :)
Interesting. So I observed these changes:
Does that seem right?
Yes perfect.. learn slow but for long :)
Sounds good. That part will fly, as this machine has something like 32 cores. I'll post anything interesting in here, but hopefully it's just normal stuff for a while.
yeah sure thank you.!
Mark,
We might need to need the second level model in the new run I think. In "secondModel.py", we have made a dev-val split based on the rows of current val sample. Since we use a different cut-off in new run, the number of rows in val will increase and so we have to change the dev-val split values in the "secondModel.py" file for a better model.
Ah, OK. To make sure I understand, you're saying we should replace 1182988 with a number that matches the new size a little better: something like 80% of the new data size.
yeah you are right. Sorry for not being clear.
Alternatively, we can get the row number that corresponds to the device id at which we did the split now (id_4388631). And we can use the number for splitting in the new run.
This way we could use the same files. val_develop_DV.csv val_validate_DV.csv
Please add these files into cloud as well
Ok. I might try both and see how the various F0.5 works out. If close, we'll opt for the one that minimizes posting.
We still have a while for anything to be output. It's on the first part of the second block still. Single-threaded on the data prep, so not much faster than our i7s.
Yes. I think it will get over late as it is just a single thread :(
One more learning for me: Start coding using multi-threading options at least in the competitions with bigger datasets
In this case, coding in parallel is tough, since it means an already large memory footprint will likely get even larger unless you can parallelize at a fine grain level. I think we'll still have time. The delays will be when it stops and I am asleep. I need to go and make sure this thing has XGBoost while it's running.
Our parallel processing is just in development right now ;-) We can use the data we have, and just hope that it scales to whatever this server leaves us with when it's ready.
Speaking of...Rob, are you able to test your ideas through to see if it impacts the overall score?
I warned you about some crazy ideas, right? First one. cookie_anonymous_5 - device_anonymous_5, but also noting that for some reason anonymous_5 of 52 should be subset. The algorithm will already do that, so the subtraction is all that is probably necessary. Seems to have a nice descending trend to me. I'd test it locally but i blew up all the files I need to run models. Server is cranking away still. Making progress, but it's a long road.
Yea. So can we include them in the model build as a new column?
hmm.. Let us wait and see :) Hopefully the codes will get over by your morning time.
Could you please post the val_predictions.csv and test_predictions.csv files once our first level of model building is done. It will be helpful. Thank you.
On Mon, Aug 24, 2015 at 3:08 PM, Mark Landry notifications@github.com wrote:
I warned you about some crazy ideas, right? First one. cookie_anonymous_5 - device_anonymous_5, but also noting that for some reason anonymous_5 of 52 should be subset. The algorithm will already do that, so the subtraction is all that is probably necessary. Seems to have a nice descending trend to me. I'd test it locally but i blew up all the files I need to run models. Server is cranking away still. Making progress, but it's a long road.
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-134112258.
Yes, I will post those.
Yeah, it wouldn't hurt to try it out as a new column, run XGBoost, and see if we get any improvement. I think it can be calculated from the raw prediction input files. The point is that if that subtraction really is useful, it will take a tree model (or linear) forever to do all the interactions necessary to capture it, since there are 200 values on each side. Maybe this feature will make XGB more efficient and allow it to capture some sparser combinations.
Yes very true. It is very tough for XGB or any other tree based model to capture this effect. Nice work on identifying it :)
On Mon, Aug 24, 2015 at 3:14 PM, Mark Landry notifications@github.com wrote:
Yes, I will post those.
Yeah, it wouldn't hurt to try it out as a new column, run XGBoost, and see if we get any improvement. I think it can be calculated from the raw prediction input files. The point is that if that subtraction really is useful, it will take a tree model (or linear) forever to do all the interactions necessary to capture it, since there are 200 values on each side. Maybe this feature will make XGB more efficient and allow it to capture some sparser combinations.
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-134115925.
Only nice if it helps in validation! I'm exporting what I can test and will run it in H2O and see what it thinks with and without it.
Let us wait and see then!
Importance: 0.00000%
Oooops!! GBM is good at capturing these effects as well?! That is interesting
Either that or it's just correlated with something we already have.
hmmm yes. second one seems more plausible.
Mark and Sudalai,
No, I didn't test anything to check for overall score. I was a bit hesitant to run anything beyond finding the clusters..wasn't sure what you guys were thinking in terms of how to best utilize this information (e.g., modify training/testing, build a new model for it alone, etc.)
Rob, just for fun, can you change those 596 anonymous_7 variables to 140 and see what happens on the plot?
I realized I do have a reasonable "level 2" set to work with. I added a few more features and have been testing it through H2O. I'm getting 0.9986 pretty consistently, and it prefers the new feature I created, though it likes the previous calculation second. This is interesting to notice that the actual prediction value itself is way down the list:
variable | percentage |
---|---|
sumDevHandle | 0.9058 |
diff_from_max_prediction_handle | 0.0786 |
maxDevHandle | 0.0063 |
mean_prediction_handle | 0.003 |
maxHandle | 0.0024 |
sumHandle | 0.0012 |
cHandle | 0.0007 |
count_handle | 0.0007 |
prediction | 0.0003 |
That first feature is the sum of the handle within that device, and similar for the third. 5th and 6th are the rates for the handle across all devices. Prediction is the raw prediction from the first-level learners.
New thread for our final efforts.