mlandry22 / icdm-2015

IDCM 2015 Kaggle Competition
1 stars 2 forks source link

New Idea List #4

Closed mlandry22 closed 9 years ago

mlandry22 commented 9 years ago

Thread to capture ideas of how we can improve our score. Sudalai has been doing great work implementing some great features. How should we focus our future work to get our score into the top few?

Ideally, this will be like a brainstorm. One post per idea, and maybe some discussion or something. But in its best form, it would become a list to constantly scan over and see what ideas we might want to try.

mlandry22 commented 9 years ago

Idea: Ensembling Impact: Moderate to Low (0.01 - 0.05?)

I think it's best to keep moving ahead with one model. But we all know that whenever we get stuck, most people win competitions with ensembles. So far XGBoost looks good. But perhaps try some highly tune H2O models, deep learning, whatever. The nice thing about classification where thresholds matter is that it's often a good use of voting or consensus techniques. If we had five different models we could require that three of them vote YES to use it, or whatever decision best optimized our F0.5 score.

mlandry22 commented 9 years ago

Idea: Latent Property Matching Impact: Potentially High (0.1)

Our first pass at similarity of websites did not seem to offer much. It could be wrong. But if not, it would seem we need something better than just matching properties together.

Latent similarity would be to try and figure out what pairs of sites are often visited together and treat those like a single site. It's a huge matrix to run SVD or similar on, but that's how that would be done. The LSA package in R with the Matrix package would allow for a fairly efficient implementation of a sparse matrix to do this.

mlandry22 commented 9 years ago

Idea: Property Category Similarity Impact: Potentially High (0.1)

Again, the first pass at website similarity was not great. We are provided a 0:N mapping of properties to categories. Like latent methods, this may help combine websites into real-world categories, just as the file is designed to do. A big advantage of this is that we can run things into a dense form that makes it easier to later compute metrics. Rather than performing similarity on-demand by looping through all possible device:cookie combinations, a single pass over the properties file (or better, a pared down version that discards cookies/devices we don't care about) merged with categories can create a matrix that can be reused often, and just indexed into on demand. That is far more efficient.

Now that I write this out....I want to do this, so I'll take it as my next course of action. Rough implementation:

CarbonCycles commented 9 years ago

Mark,

Okay, I do see the website similarity...did you have a chance to see if this played out...multiple race paths.

Rob

On Thu, Aug 6, 2015 at 7:44 PM, Mark Landry notifications@github.com wrote:

Idea: Property Category Similarity Impact: Potentially High (0.1)

Again, the first pass at website similarity was not great. We are provided a 0:N mapping of properties to categories. Like latent methods, this may help combine websites into real-world categories, just as the file is designed to do. A big advantage of this is that we can run things into a dense form that makes it easier to later compute metrics. Rather than performing similarity on-demand by looping through all possible device:cookie combinations, a single pass over the properties file (or better, a pared down version that discards cookies/devices we don't care about) merged with categories can create a matrix that can be reused often, and just indexed into on demand. That is far more efficient.

Now that I write this out....I want to do this, so I'll take it as my next course of action. Rough implementation:

  • Get main matrix
    • Index id_property by device_or_cookie_id
    • Get list of unique devices and cookies that appear in our dev, validation, or test files.
    • Create static columns for each of the 443 categories
    • Populate those categories with the frequencies in the id_properties file.
    • Save results as CSV
  • Calculate similarity
    • For dev/val/test files, loop through device:cookie pairs
    • Simple index into main dense matrix to compute simple cosine distance or similarly efficient matrix calculation
    • If efficient to do this, try a version that uses 1/0 and one that uses frequency (or logged frequency)
  • Hope for the best

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/4#issuecomment-128550117.

mlandry22 commented 9 years ago

The website one certainly did not. There are bunch of tables to that effect in the other thread.

As for trying it on category, so far I haven't gotten a chance to plug it into our data sets. But that's the easy part, just need some time to get that in.

SudalaiRajkumar commented 9 years ago

Mark, please let us know once you plug in the category information and run the models. Eagerly looking forward to see the improvement in results :)

SudalaiRajkumar commented 9 years ago

I came across this forum post and people are suggesting very similar to whatever we did. https://www.kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/forums/t/14950/prediction-as-binary-classification

As of now, we are using just IP based rule for selecting cookies. Should we also think about some other hand made rules to reject some of these selected cookies? That will help reduce the testing set size further and thereby we may have a chance of increasing our precision. Just a thought to ponder.

Also this post https://www.kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/forums/t/15877/legal-or-not/89117#post89117 It seems this guy (and may be other leaders?) found out some way to reverse engineer the data creation part. May or may not be true. Just thought of letting you guys know as well.

mlandry22 commented 9 years ago

Working on it now. I'm using my "Val" set and everything seems to be going in fine. Very very early results didn't look great. Of the first devices I had, one's correct answer had an NA in the category data (hopefully because they had no entries), and one had almost the furthest away be correct. Others looked in the middle. But, that was just on the first couple hundred records, so we'll see what a larger sample looks like shortly. I have 1.43M device:cookie combinations in my val sample.

What would help is if somebody can manually check that these results are plausible. It will seem cumbersome to crosswalk IPs and categories, but if we can be sure this similarity method I'm using is correct, that would be helpful to know what the next course of action is, if it doesn't turn out helpful.

So here are the device:cookie combinations that would be helpful to double-check:

Here are the results of my function:

device_id cookie_id target similarity
id_1000068 id_1826837 0 0.0420082828742723
id_1000068 id_154107 0 0.0498923156036742
id_1000068 id_2876982 1 NA
id_1000068 id_2804227 0 NA
id_1000068 id_2832355 1 0.606453019416486
id_1000068 id_847788 0 0.0404198919031752
id_1000068 id_1716499 0 NA
id_100034 id_769531 0 0.143977017973322
id_100034 id_2091368 1 0.209220241517009

This time the scores are distance, so if these scores are right, we should see the first two cookies as very similar to the device (low distance); the third and fourth unavailable, the fifth one not very similar, the sixth one again very similar, and so forth. Note that two devices are given, so the comparisons for the last two records will be against a different device as the first seven.

If that is too much, validating that the NA's are correct is probably useful as well. You can see from these numbers why I might be a bit worried. An NA and one with the worst distance are correct on the first, and the further one is correct on the second. The next few are fairly similar. Hopefully, these first few are not indicative of the rest.

CarbonCycles commented 9 years ago

I can help tonight.

I'll touch base after dinner.

Rob

On Aug 13, 2015, at 11:47 AM, Mark Landry notifications@github.com wrote:

Working on it now. I'm using my "Val" set and everything seems to be going in fine. Very very early results didn't look great. Of the first devices I had, one's correct answer had an NA in the category data (hopefully because they had no entries), and one had almost the furthest away be correct. Others looked in the middle. But, that was just on the first couple hundred records, so we'll see what a larger sample looks like shortly. I have 1.43M device:cookie combinations in my val sample.

What would help is if somebody can manually check that these results are plausible. It will seem cumbersome to crosswalk IPs and categories, but if we can be sure this similarity method I'm using is correct, that would be helpful to know what the next course of action is, if it doesn't turn out helpful.

So here are the device:cookie combinations that would be helpful to double-check:

Here are the results of my function: device_id cookie_id target similarity
id_1000068 id_1826837 0 0.0420082828742723
id_1000068 id_154107 0 0.0498923156036742
id_1000068 id_2876982 1 NA
id_1000068 id_2804227 0 NA
id_1000068 id_2832355 1 0.606453019416486
id_1000068 id_847788 0 0.0404198919031752
id_1000068 id_1716499 0 NA
id_100034 id_769531 0 0.143977017973322
id_100034 id_2091368 1 0.209220241517009

This time the scores are distance, so if these scores are right, we should see the first two cookies as very similar to the device (low distance); the third and fourth unavailable, the fifth one not very similar, the sixth one again very similar, and so forth. Note that two devices are given, so the comparisons for the last two records will be against a different device as the first seven.

If that is too much, validating that the NA's are correct is probably useful as well. You can see from these numbers why I might be a bit worried. An NA and one with the worst distance are correct on the first, and the further one is correct on the second. The next few are fairly similar. Hopefully, these first few are not indicative of the rest.

— Reply to this email directly or view it on GitHub.

mlandry22 commented 9 years ago

Cool. I wrote an inefficient loop to get my similarities out and nearly stopped it, but waited until I could do something better or use it, and now it's been running all day. Surely I have enough similarity calculations out to see if this thing is useful or not.

mlandry22 commented 9 years ago

Similarity just ranked last on my random forest importance list. WTF? Need some help, I guess?

In case I'm not screwing up the calculations, what ideas do we still have to improve? With the amount of energy put into this, I can't believe the top 10% would be so hard to crack. Again, Sudalai has done great things to get the python features in order. I've now given a stab at two ways of doing websites, directly and indirectly.

Rob, if you can validate what I'm doing, that would be great. Not the code yet (which might be wrong, I don't know), just the simple human observations for those test devices. It's not as easy as it looks, you have to get all the IPs, then all the categories. But with just 9 records, hopefully it's easy enough.

mlandry22 commented 9 years ago

I will also validate, and I'll go ahead and try to traverse all known data for those devices. So Rob, if you can hunt down the properties and categories, that will be great for confirming both my code results and also my analysis methods.

mlandry22 commented 9 years ago

Bad about chaining multiple comments together. So this is the fourth comment in a row, in case you guys see the emails. I wanted to note that I used a quarter million device:cookie combinations, so the chance that this is a bad sample is fairly low. Bad code? Maybe. Believable results? Yes.

CarbonCycles commented 9 years ago

Ok just got in. Went out since the wife is leaving this weekend w the kids. I'll jump on in a bit.

On Aug 13, 2015, at 8:35 PM, Mark Landry notifications@github.com wrote:

Bad about chaining multiple comments together. So this is the fourth comment in a row, in case you guys see the emails. I wanted to note that I used a quarter million device:cookie combinations, so the chance that this is a bad sample is fairly low. Bad code? Maybe. Believable results? Yes.

— Reply to this email directly or view it on GitHub.

CarbonCycles commented 9 years ago

Mark,

Okay, I've got things up. Just to make sure I understand what you want...you want the IP addresses joined with the categories.

To do this, I was going to use the id_all_ip.csv and do a merge with id_all_property.csv.

That should at least associate the ip with category via common device/cookie id...did you have something else in mind?

Rob

On Thu, Aug 13, 2015 at 8:27 PM, Mark Landry notifications@github.com wrote:

Similarity just ranked last on my random forest importance list. WTF? Need some help, I guess?

In case I'm not screwing up the calculations, what ideas do we still have to improve? With the amount of energy put into this, I can't believe the top 10% would be so hard to crack. Again, Sudalai has done great things to get the python features in order. I've now given a stab at two ways of doing websites, directly and indirectly.

Rob, if you can validate what I'm doing, that would be great. Not the code yet (which might be wrong, I don't know), just the simple human observations for those test devices. It's not as easy as it looks, you have to get all the IPs, then all the categories. But with just 9 records, hopefully it's easy enough.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/4#issuecomment-130911131.

CarbonCycles commented 9 years ago

Mark,

Sorry I didnt respond back at lunch....couldn't make it home. I kind of did something a bit hokey, but it looked interesting (as I was also trying to get your information).

I got the following relationship, which I see both listed above. It wasn't too hard following these two down. I will zip up and send the three files I used to get these definitive two...proceed as follows:

File "id_1000068.csv" gives you the device id. File "df_dev_train_device_id" gives you the drawbridge handle by referencing device id File "cookie.csv" then finally gives you the cookie_id.

The files I have zipped up include all data from all columns...nothing has been removed...might be able to see some more information there. I'm going ot get some dinner but will be working on this throughout...will look at the other cookie ids you found.

id_dict_key drawbridge_handle cookie_id id_1000068 handle_1644226 id_2832355 handle_1644226 id_2876982

On Thu, Aug 13, 2015 at 10:24 PM, Rob robbie.j.c@gmail.com wrote:

Mark,

Okay, I've got things up. Just to make sure I understand what you want...you want the IP addresses joined with the categories.

To do this, I was going to use the id_all_ip.csv and do a merge with id_all_property.csv.

That should at least associate the ip with category via common device/cookie id...did you have something else in mind?

Rob

On Thu, Aug 13, 2015 at 8:27 PM, Mark Landry notifications@github.com wrote:

Similarity just ranked last on my random forest importance list. WTF? Need some help, I guess?

In case I'm not screwing up the calculations, what ideas do we still have to improve? With the amount of energy put into this, I can't believe the top 10% would be so hard to crack. Again, Sudalai has done great things to get the python features in order. I've now given a stab at two ways of doing websites, directly and indirectly.

Rob, if you can validate what I'm doing, that would be great. Not the code yet (which might be wrong, I don't know), just the simple human observations for those test devices. It's not as easy as it looks, you have to get all the IPs, then all the categories. But with just 9 records, hopefully it's easy enough.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/4#issuecomment-130911131.

mlandry22 commented 9 years ago

Rob, do the files you are planning on include the property and/or category overlap? That's the specific thing I'm interested in testing. Just asking because I don't see a reference to that.

At the end, I'm mostly interested in a human opinion that given the property/category overlap, the similiarity scores in my post above seem reasonable. If they are reasonable, then the concept of comparing properties and categories is essentially dead. It is, unless I did something wrong. That's what I'm asking.

I can make it easier to post more midstream processing, such as the 430 columns of categories for each of the devices/cookies. That breaks up the task in two: (1) are these numbers correct using the data you all have; (2) given these numbers, is a similarity score of S reasonable?

SudalaiRajkumar commented 9 years ago

Sorry for breaking the flow here.

One way we could improve our score by a small margin (max ~0.02 I guess) is as follows: There is a variable "val_test_shared_ip_cutoff" in the file "getIDV.py" of sub12. This cut off is currently set as 30. If the number of cookies in a given common IP is more than 30, then we are not considering those cookies for our score computation. This affects the recall since by setting this cut-off we are filtering out some of the potential cookies. We could see that we have about 703 device_ids in our best submission (sub15.csv) that does not have any cookies to match (with cookie_id="id_10"). So by increasing this cutoff, we could potentially end up finding some cookies for these devices.

So if we could make this cut-off to a higher value (say 200 or more) and re-run the models, we might potentially end up scoring 0.02 higher than our current score which might help us improve 3-4 places again.

I couldn't run this as this will need more ram and disk space and am running out of both :( Would it be possible for one of you to run this version after changing the cut-off. We will need to use the new "getMetric_new.py" from the SRK/sub15 folder for final scoring. Apologies for the inconvenience.

mlandry22 commented 9 years ago

Running it at 200 right now. I changed both flags that were 30: shared_ip_cutoff and val_test_shared_ip_cutoff. Does that sound OK?

I can analyze which of the 703 devices this improves and post a CSV to this Github for the direct answers to those devices that way you both can use what I'm doing in the future.

CarbonCycles commented 9 years ago

Yes that would be useful. My naive clustering was wondering but providing it some hints may help.

I'm definitely holding my breath to see what happens in today's submissions.

This is definitely fun stuff!

On Aug 18, 2015, at 7:44 AM, Mark Landry notifications@github.com wrote:

Running it at 200 right now. I changed both flags that were 0: shared_ip_cutoff and val_test_shared_ip_cutoff. Does that sound OK?

I can analyze which of the 703 devices this improves and post a CSV to this Github for the direct answers to those devices that way you both can use what I'm doing in the future.

— Reply to this email directly or view it on GitHub.

SudalaiRajkumar commented 9 years ago

Thanks Mark. Yes that sounds ok.

I thought since we have huge training set already it is just enough to change the val_test_shared_ip_cutoff. I think this will create a huge dev_all_rows file and a even bigger intermediate file when we run getIDV_IP.py.

If it creates a huge dev file which is taking more time to process, please feel free to do the sampling first and then create the IP variables or alternatively change the cutoff to a smaller value :)

mlandry22 commented 9 years ago

OK, the prep just got done, on the smaller 30/200. And running the next step in line maxed out the small SSD on this system, so I'll clean up some data and rerun.

mlandry22 commented 9 years ago

In cleaning up, this seems odd: val_ip_vars_inter.csv: 34.4GB dev_ip_vars_inter.csv: 14.2GB

Dev might have cut off early, but usually it gets processed first (haven't looked). Maybe that's the impact of having dev at 30 and val/test at 200? I'll have to move this up to a bigger server tonight, I think, like I did with the category processing. More memory, storage (~90GB usable on this SSD), and processors.

For today's submissions--since I usually wind up too late to hit deadlines--I'm going to write a method in R to take the existing predictions I have and do the 95/5 method I mentioned. That will be one submission to see simply if it's better or worse than before. Then, if I have enough time, I'll try to do a simple IP check on those 700+ missing. No cookies guarantees 0 unless id_10 is correct; using all cookies matching those IPs is likely greater than 0 for some, so guaranteed to be the same or better. And if there's more time than that, perhaps find a quick and dirty way to pick a few from that list.

mlandry22 commented 9 years ago

I'm going to take a step back and look around a bit. My next thoughts were based on thinking the model wasn't handling devices with many cookies very well, but it really handles those just as well or slightly better than those with 5 or 6 cookies, as per the table and graph we looked at last night. If I'm going to recompute some heavy stuff on the server, I'll see if I can add some more things while I'm there. Something I'd really like, to help with slicing afterwards is the cookie count per device, so I'll see if I can get some python written to do that.

On one hand, it seems like we've put a ton of energy into this, and 23rd/344 seems fairly low.

On the other, the team in 12th and 16th have some great Kagglers, and aren't that far ahead of us. The guy in 2nd was in 2nd for a while in the rain one, and is drawn to these non-straightforward problems like I am. So maybe there isn't some big secret we're missing and grinding it out improvement by improvement is the way to do it. Here is how the distribution from just below us to the top looks. To me, other than the odd cluster at 85* it suggests you simply move up a few decimal places at a time.

F0.5 Number MinRank Notes
70* 1 28
71* 0
72* 2 27
73* 0
74* 2 25
75* 3 22 We are here
76* 2 20
77* 1 19
78* 2 17
79* 3 14 gilberto/lars here
80* 0
81* 2 12 team with #22, #31,#36,#49 ranked Kagglers here
82* 0
83* 1 11
84* 1 10
85* 6 4 seems strangely popular, no?
86* 0
87* 2 2
88* 0
89* 1 1
mlandry22 commented 9 years ago

First thought. We have some nice aggregate statistics about IPs by device and cookie. But I think we don't have the C1-C5 values of the IP data compared directly between the device and cookie for the IP that they share.

I will try a few thousand and see if it is worth pursuing. The idea is that the value of these columns would be different for the same device, based on which cookie is being compared. And ideally, we're looking for matches, but anything might help.

CarbonCycles commented 9 years ago

Mark,

I can help write some code for the cookie count per device...do you have a data file I can iterate?

On Tue, Aug 18, 2015 at 11:39 PM, Mark Landry notifications@github.com wrote:

First thought. We have some nice aggregate statistics about IPs by device and cookie. But I think we don't have the C1-C5 values of the IP data compared directly between the device and cookie for the IP that they share.

I will try a few thousand and see if it is worth pursuing. The idea is that the value of these columns would be different for the same device, based on which cookie is being compared. And ideally, we're looking for matches, but anything might help.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/4#issuecomment-132443802.

SudalaiRajkumar commented 9 years ago

Nice thoughts Mark.

"We have some nice aggregate statistics about IPs by device and cookie. But I think we don't have the C1-C5 values of the IP data compared directly between the device and cookie for the IP that they share. " - This is definitely worth trying.

In the meantime, after seeing the graphs of both of you guys, I tried building a separate model only for those devices with cookies in 2 to 10 range. But it did not add any improvement to our base model. Our base model itself is doing the best for all cookie ranges. So I think we have to get new variables which are able to discriminate those cookie-device pairs.

mlandry22 commented 9 years ago

Rob, it's the results of the IP match. So all three of Sudalai's output files will have it, and you can use that as a basis. I.e. it can be looked at as a post-processing of the original file. Or you can look at the code he has toward the end and add somehing. There is a cookie count, but I see it listed as 1 for the device I've been using over and over, where the number I'd like to see is 7.

Sudalai, I'm not too surprised. Performance is a little worse on the middle range, but it's still fairly stable and never really bad. I love segmented models, but GBM is often too smart to need multiple models (finding caterpillar to be similar)!

By the way, the IP analysis is really fast with an indexed data.table, so I should be able to use this fairly quickly, I hope. It is interesting how much the values vary on the same device across different IPs. I haven't seen anything useful yet, but I sholdn't have any trouble getting these features working tonight.

SudalaiRajkumar commented 9 years ago

Thanks Mark. Hoping to see some improvement based on these new variables.

Yes. GBM if often smart to need multiple model. I have found this phenomenon in tree based models in general.

I started looking at the property and category file to do some clustering to get some useful features out of it, but am kind of stuck thinking about the useful features. I will try my best to find a way.

mlandry22 commented 9 years ago

I was wrong--the common IP variables are already captured. They're averages, but that really is the same in most cases since most share an IP only once. So the range of features common_ip_device_c1_sum: common_ip_cookie_c5_sum does what I was mainly wanting.

Next idea, I guess. As always, nice feature Sudalai!

mlandry22 commented 9 years ago

A second-layer model might make some sense here. The reason why is that we can give the second-layer model something the first doesn't know: the range of predictions per device. So the first model would assess gross probabilities, the way we have now. The second one would take those probabilities and the original feature matrix, and add some new features based on the range of probabilities from the first one. So if it was more likely to have multiple cookies in a particular country or OS, then we could use that information to our advantage. It would already be slightly in the first model, but a second model would have a lot more power to find those if it had a lot of relative probability features per device.

So in my favorite test case (id_1000068), we would have a second model understand that for that device, we had four high predictions, three really high and close to each other. We'd have the raw probability plus the relative compared to the max, and perhaps the ranking as well. Then perhaps it can use that information to interact with some of the original features.

I won't try such a model, but it's perfectly self-contained so anybody could pick it up and try it out on the existing validation set. We'd want to have more validation data probably, but we could see if there was any advantage with the validation set we have now, which is fairly large.

SudalaiRajkumar commented 9 years ago

I tried a second stage model after creating some variables based on predicted probabilities like number of cookies associated with device, min prob score, max prob score, mean score, difference from max score.

I then added our old variables and ran a xgb model. New val sample AUC improved from 0.9946 to 0.9952. But when I computed the F0.5 score, it was still the same :( I have placed the codes in to the folder SRK/SecondLevelModel. Please have a look and let me know is this what you people meant.

mlandry22 commented 9 years ago

Thanks for trying and posting. That sounds exactly like what I thought might help. But, it didn't. That method might benefit from the initial level of thresholding, in case the ranks didn't change much, but we got more resolution/precision to use particular cookies or not. But, with such a small AUC bump and no F0.5 gain, probably not worth a whole lot in the end no matter what. Not the coveted 0.1.

I experimented with a few new features and the best was to divide each record's cookie_c3_common_ip_by_all_ip by the device's max for that value. Most are 1, but some are not, so it scales them up. My H2O model liked it more, but it still is a step behind XGBoost. So I'll try dumping out my data set and using the XGB model to create a submission to see if it helps. I'll only submit if it shows promise in F0.5 for validation, since this is a feature that is quite likely to help only AUC or overall F0.5, but not the per-device F0.5

mlandry22 commented 9 years ago

Yes, that second level model looks just like what I was thinking. Good features (in R!) and a new splitting of that validation set. With enough of these ideas being thrown out,we're bound to find something, right? :-)

SudalaiRajkumar commented 9 years ago

Yes. I started learning R from you guys :)

Yup. Now since we have rejected most of the plausible ideas, we should be finding the right one soon.!

I am just wondering what other guys at the top are doing.. A score above 0.89 means we need to have very high precision across all cookie ranges.. Just thinking out loud.. We have tried out a variety of features.. We have tried out various models.. We have tried out second level modeling.. Are we missing something simple here? Are they doing something different to get the initial set of cookies instead of IP match? Are they using some excellent variables which discriminate the cookies better?

P.S: We will move up by one place in the final ranking.. This guy (https://www.kaggle.com/rangohu) is a fake account. Created 6 days ago and the best submission on the same day. Quite impossible.. Someone whose score is above 0.78 created this account I guess !!

CarbonCycles commented 9 years ago

Okay guys...this is really bugging me. Maybe I'm just really slow (i.e. stupid), but help me understand something here.

Goal of this kaggle is to link the device id to the cookie id...okay.

We are given some interesting data: device id, cookies, ip, categories, property, anonymous fields, drawbridge id, etc, which are mostly categorical in nature.

Now, there seems to be this jump in logic where you have to take a combination of these fields, generate features from it and then establish the link between device id and cookie.

This is obviously not a regression problem, but it's being solved in a classification POV. This kind of bothers me because I also see this as a clustering/commonality problem in addition to the classification. For example, on some of the device id/cookies we can literally trace end-to-end via the drawbridge id. However, for others there is no direct way. Instead, we have to create "threads". A thread is composed of a set of features such as IP, properpty, anonymous feature, OS system, etc. where you find all common threads (this is the similarity {classification?} among the threads) whereas finding threads is clustering.

Once you cluster the common threads, you should be able to use the training / testing sets to validate because the assumption being made here is that at least one of the threads will have a drawbridge ID within it that you can train/test off of.

This approach is a bit different than what we've been attempting with XGBoost, which btw have you guys seen this blog http://auduno.com/post/96084011658/some-nice-ml-libraries?

I've been attempting the first part by defining a "thread" and then clustering by thread...but I'm chasing my tail. Wtf..am I just being overly dense?!?

Mark, I'll work on the code for the counts as you suggested last night..I put my head down and passed out before I knew it.

On Wed, Aug 19, 2015 at 2:21 PM, SudalaiRajkumar notifications@github.com wrote:

Yes. I started learning R from you guys :)

Yup. Now since we have rejected most of the plausible ideas, we should be finding the right one soon.!

I am just wondering what other guys at the top are doing.. A score above 0.89 means we need to have very high precision across all cookie ranges.. Just thinking out loud.. We have tried out a variety of features.. We have tried out various models.. We have tried out second level modeling.. Are we missing something simple here? Are they doing something different to get the initial set of cookies instead of IP match? Are they using some excellent variables which discriminate the cookies better?

P.S: We will move up by one place in the final ranking.. This guy ( https://www.kaggle.com/rangohu) is a fake account. Created 6 days ago and the best submission on the same day. Quite impossible.. Someone whose score is above 0.78 created this account I guess !!

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/4#issuecomment-132749006.

mlandry22 commented 9 years ago

Sudalai, I am thinking the same thing. I am amazed at how large the difference is between the top (0.89) all the way down far below us, and yet how fairly linear it is.

We remember the rain problem, where most teams in the top set were all clustered together. And there wasn't a huge difference between the next tier and the masses. In fact, I think my model just a few days in was good for 10th or 15th, and then I got very little improvement after that.

We saw the same thing in terms of the big CTR data competitions. Avazu, the top few were way beyond the rest of the field, and the top tier was a small, but consistent amount better than almost everybody separated by the 6th decimal place.

Here, it really seems like a huge difference between what others are doing.

So I also go back to the same constructs. The properties and categories really seem like a big missing piece. I feel confident that I didn't do anything grossly wrong. However, I only tried two ideas.

Rob, will answer yours in a little bit.

CarbonCycles commented 9 years ago

Let me bounce an idea off you guys...I just re-remembered an old trick I used to do when I had to work with a large number of unwieldy data.

It's similar to the string hash trick you talked about Mark with vowpal wabbit. I would concat all the categories into a large string array. You can either directly consume the string array by doing sorts, slices, groupbys on it, or you can can create either a byte array or hex. Once you do the encoding, you can run your clustering/similarity on that.

Did that make sense?

On Wed, Aug 19, 2015 at 4:17 PM, Mark Landry notifications@github.com wrote:

Sudalai, I am thinking the same thing. I am amazed at how large the difference is between the top (0.89) all the way down far below us, and yet how fairly linear it is.

We remember the rain problem, where most teams in the top set were all clustered together. And there wasn't a huge difference between the next tier and the masses. In fact, I think my model just a few days in was good for 10th or 15th, and then I got very little improvement after that.

We saw the same thing in terms of the big CTR data competitions. Avazu, the top few were way beyond the rest of the field, and the top tier was a small, but consistent amount better than almost everybody separated by the 6th decimal place.

Here, it really seems like a huge difference between what others are doing.

So I also go back to the same constructs. The properties and categories really seem like a big missing piece. I feel confident that I didn't do anything grossly wrong. However, I only tried two ideas.

Rob, will answer yours in a little bit.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/4#issuecomment-132788446.

mlandry22 commented 9 years ago
We are given some interesting data: device id, cookies, ip, categories,
property, anonymous fields, drawbridge id, etc, which are mostly
categorical in nature.

Now, there seems to be this jump in logic where you have to take a
combination of these fields, generate features from it and then establish
the link between device id and cookie.

This is obviously not a regression problem, but it's being solved in a
classification POV. This kind of bothers me because I also see this as a
clustering/commonality problem in addition to the classification.

Two big facts

I think where perhaps we can do more is understand why "cookie_c3_common_ip_by_all_ip" works as well as it does. A solution with that alone as a feature would be a really good model.

Rob, I've thought about doing that in chunks, like taking all the anonymous features as a set to avoid having the GBM to find them itself (which it kinda does, kinda doesn't). I stopped when those features started to look like counts to me, rather than factors. But I would agree that any such thing might be worth trying, and especially things that looked like categoricals. Most of ours are not, they're calculations. But I'm a fan of binning numerics and then treating that like a categorical. I'm not a huge fan of clustering, but hey if you feel that can provide value, try it out, I suppose.

Disclaimer: Sudalai, Rob asked many questions in the last couple of the rain competition. I tended to explain away many of them without trying. One such question was "should we be throwing out outliers?" Had I entertained that in code, I suspect I would have improved our ranking by one or more places. But I answered (reasonably) that it wasn't needed due to how it was being processed. Kinda true, but not enough, as it turned out.

Known gaps

mlandry22 commented 9 years ago

Reposting the direct correlation table below, but in a better format so it's easier to see the important ones. I was first curious what makes cookie_c3_common_ip_by_all_ip such a good feature. It seems similar to the others. But this correlation table shows us that it's about the same as the other features of it's type. So it probably isn't much different, it just gets at a concept that is interesting in a good way.

Two sides of that, right? One: assume the feature is doing it's job. Two: is there a better way to drag out the information in there? I'm posting a snippet of the code that does these features, just to refresh our memory of how they are created.

Correlation Field Root
0.509 cookie_c3_common_ip_by_all_ip cookie
0.499 cookie_freq_common_ip_by_all_ip cookie
0.496 cookie_c5_common_ip_by_all_ip cookie
0.48 cookie_c2_common_ip_by_all_ip cookie
0.458 cookie_c4_common_ip_by_all_ip cookie
0.447 cookie_c1_common_ip_by_all_ip cookie
0.441 ratio_cookie_count_by_num_ip_cookie cookie
0.437 common_ip_cookie_c3_avg cookie
0.408 device_freq_common_ip_by_all_ip device
0.396 device_c3_common_ip_by_all_ip device
0.385 common_ip_cookie_c4_avg cookie
0.383 common_ip_cookie_c5_avg cookie
0.358 common_ip_device_c3_avg device
0.347 device_c1_common_ip_by_all_ip device
0.345 device_c2_common_ip_by_all_ip device
0.344 device_c5_common_ip_by_all_ip device
-0.336 cell_ip_avg ip
0.325 device_c4_common_ip_by_all_ip device
0.319 common_ip_device_c4_avg device
0.317 common_ip_device_c5_avg device
-0.308 max_cookie_in_device_ip cookie
-0.303 min_cookie_in_ip cookie
-0.302 mean_cookie_in_ip cookie
-0.296 max_cookie_in_ip cookie
0.233 ratio_cookie_count_by_num_ip_device cookie
-0.232 mean_cookie_in_device_ip cookie
0.21 common_ip_device_c5_sum device
0.202 common_ip_device_c3_sum device
0.191 common_ip_cookie_freq_sum cookie
0.186 common_ip_cookie_freq_avg device
0.178 common_ip_device_c4_sum device
0.176 common_ip_cookie_c5_sum cookie
0.163 common_ip_cookie_c2_sum cookie
0.163 common_ip_cookie_c2_avg cookie
0.162 common_ip_cookie_c1_sum cookie
-0.16 mean_cookie_in_cookie_ip cookie
0.159 common_ip_cookie_c1_avg cookie
0.156 common_ip_cookie_c3_sum cookie
0.142 common_ip_device_freq_sum device
-0.14 max_cookie_in_cookie_ip cookie
0.139 common_ip_cookie_c4_sum cookie
-0.135 ip_c1_avg ip
0.133 common_ip_device_freq_avg device
-0.125 cookie_anonymous_5 cookie
0.125 common_ip_device_c1_sum device
0.121 ip_freq_avg ip
0.118 common_ip_device_c2_sum device
-0.116 device_anonymous_5 device
0.114 common_ip_device_c1_avg device
0.112 common_ip_device_c2_avg device
-0.108 ip_c2_avg ip
-0.104 num_ip_with_cookie cookie
0.101 ip_freq_sum ip
-0.094 device_anonymous_7 device
-0.089 cookie_anonymous_7 cookie
-0.084 cell_ip_rate_device ip
-0.083 num_ip_with_device device
-0.08 cell_ip_sum ip
0.076 cookie_country cookie
-0.064 ip_c0_avg ip
0.062 cookie_anonymous_c2 cookie
-0.059 cell_ip_rate_cookie ip
0.052 cookie_count cookie
0.051 min_cookie_in_device_ip cookie
0.047 cookie_anonymous_c1 cookie
0.04 cookie_anonymous_6 cookie
-0.039 ratio_num_ip_cookie_by_num_ip_device cookie
-0.036 min_cookie_in_cookie_ip cookie
0.03 device_type device
0.027 cookie_computer_browser_version cookie
0.025 cookie_computer_os_type cookie
-0.024 ip_c1_sum ip
-0.021 cookie_anonymous_c0 cookie
-0.021 ip_c2_sum ip
0.017 device_anonymous_6 device
-0.012 device_anonymous_c0 device
0.007 device_anonymous_c1 device
-0.005 device_os device
-0.004 device_country device
0.004 device_anonymous_c2 device
-0.002 ip_c0_sum ip

Code to create the best correlated features:

    for row in reader:
        device_id = row["device_id"]
        cookie_id = row["cookie_id"]
        out_row = [device_id, cookie_id]
        device_ip_list = eval(row["device_ip_list"])
        cookie_ip_list = eval(row["cookie_ip_list"])

        device_ips = set([ip_list[0] for ip_list in device_ip_list])
        cookie_ips = set([ip_list[0] for ip_list in cookie_ip_list])
        common_ips = list(device_ips.intersection(cookie_ips))

        device_ip_dict = {}
        for ip_list in device_ip_list:
            device_ip_dict[ip_list[0]] = ip_list[1:]

        cookie_ip_dict = {}
        for ip_list in cookie_ip_list:
                        cookie_ip_dict[ip_list[0]] = ip_list[1:]
        . . .
        for common_ip in common_ips:
        . . .
                        cookie_c4_sum += int(cookie_ip_dict[common_ip][4])      
        . . .
                for cookie_ip in cookie_ips:
                        cookie_freq_all_ip_sum += int(cookie_ip_dict[cookie_ip][0])
                        cookie_c1_all_ip_sum += int(cookie_ip_dict[cookie_ip][1])
                        cookie_c2_all_ip_sum += int(cookie_ip_dict[cookie_ip][2])
                        cookie_c3_all_ip_sum += int(cookie_ip_dict[cookie_ip][3])
                        cookie_c4_all_ip_sum += int(cookie_ip_dict[cookie_ip][4])
                        cookie_c5_all_ip_sum += int(cookie_ip_dict[cookie_ip][5])
            if ip_agg_dict.has_key(cookie_ip):
                                cookie_cell_ip_sum += int(ip_agg_dict[cookie_ip][0])
                cookie_freq_common_ip_by_all_ip = round( cookie_freq_sum / max(cookie_freq_all_ip_sum,1) ,5)
                cookie_c1_common_ip_by_all_ip = round( cookie_c1_sum / max(cookie_c1_all_ip_sum,1) ,5)
                cookie_c2_common_ip_by_all_ip = round( cookie_c2_sum / max(cookie_c2_all_ip_sum,1) ,5)
                cookie_c3_common_ip_by_all_ip = round( cookie_c3_sum / max(cookie_c3_all_ip_sum,1) ,5)
                cookie_c4_common_ip_by_all_ip = round( cookie_c4_sum / max(cookie_c4_all_ip_sum,1) ,5)
                cookie_c5_common_ip_by_all_ip = round( cookie_c5_sum / max(cookie_c5_all_ip_sum,1) ,5
mlandry22 commented 9 years ago

Thought I was onto something good. But again, our model has me beat.

I started to think that something we aren't doing is seeing how our own predictions for COOKIES overlap across devices. If we are really confident on device A and barely confident on device B, maybe that information is useful to drive down device B, or at least take a harder look at whether device A and device B might be the same (hmmm....there's an idea).

However, we are doing quite well in multiple-cookie situations. We rarely see our models be highly confident about more than 1 device per cookie, and when we do, it's often right.

We have a table below counting how many validation cookies exist across two dimensions: Number of predictions >0.8, and number of actual valid matches. You can read it best by looking at the third columns. If we have 2 predictions that over over 0.8, 364 times, there really are 2 matches out there; 23 times there is only 1; 160 times there are none; and 5 times there are 3 matches.

        over80
DV_count      0      1      2      3
       0 504080   8084    160      2
       1   9152  23751     23      0
       2    222    159    364      1
       3      8      4      5      9
       4      0      0      0      1

I created a table and went looking for all sorts of things that might benefit from knowing these probabilities. Yet, they didn't help at all.

However...... Perhaps when we have a device where a predicted cookie matches a device in the training set, we should look harder at whether the two devices match. We have a known answer in those cases, and it certainly happens that cookies belong to more than 1 device. But perhaps we can have some algorithm that compares whether two devices are really the same. Somewhere in all these comments, I found the overlap rate, and it was quite high, I don't know if it's high enough to work on a brand new model, but it might be interesting. The advantage of having on opinion of whether a test set device is the same as a training set device is that we might remove some cookies that are not associated with the training drawbridge handle (with caution), and add some cookies that were associated with the drawbridge handle.

Hmmm....might be interesting. Especially since device:device matching is much easier based on the characteristics are the same. OS:OS can be done, etc.

mlandry22 commented 9 years ago

Overlap is not as high as I was hoping. Every turn is disappointing ;-/ Here was that comment, and 2.3% is not a whole lot. I'll let you guys decide if it's good enough to warrant a device:device model.

The second submission checked whether or not the cookie we were using was known 
to exist as part of a device, via the drawbridge ID. Again, worse, though not by much.
Why? Well 2.3% of devices in the train set share a drawbridge ID with another device. 
This means cookies tied to either are valid. And therefore, just because we know a 
cookie is affiliated with a device in the train set doesn't mean it is not valid 
in the test set. My pair of simple submissions is evidence.
CarbonCycles commented 9 years ago

Mark,

All very valid and true. At this time, it seems like we are directly using the shared cookies. Looking at your correlations, it is confirming that the decision to do this is correct, but I'm also seeing no definitive breaks in the correlation matrix. Usually, there should be a pretty significant break in the magnitude...not seeing it.

We are missing something here. It looks like a single user has multiple devices that have a common IP going back to your device:device model.

Maybe it's this:

Drawbride ID --> Multiple Devices --> Common IP --> Several cookies?

R

mlandry22 commented 9 years ago

Not quite following the correlation break comment?

I'm not even sure that the multiple devices have a common IP. They probably do. I just know that if you look at the device file, you find the same drawbridge associated with multiple devices sometimes. Here are the real numbers: 0 device IDs: 61,156 (the test set) 1 device ID: 136,202 2 device IDs: 3096 3 device IDs: 110 4 device IDs: 9 5 device IDs: 2

If I do the same thing for cookies, I see the following: 1: 1,488,005 2: 55,617 3: 8,305 4: 2,212 5: 795 6: 347 7: 196 8: 98 9: 52 10: 59 (interesting?) 11: 28 12: 15 .... 28: 1

However....there is no room for a pair of devices to share a pair of cookies, right? A handle can belong to more than one device. A handle can belong to more than one cookie. But a cookie cannot belong to more than one handle. And a device cannot belong to more than one handle. Yet, these things surely do really happen. (edit: maybe they don't; I guess you can put all of them through the same handle and everybody is happy)

I'm taking off for the moment, but thinking through if there's some way we can test a few of these out, it would seem like there might be something to exploit here.

CarbonCycles commented 9 years ago

I've found in the past that the correlation for parameters will show a natural break. For example, the top 5 will have a high correlation and then you will see a drop/step in correlation for the next set and so on. The correlation matrix is showing a gradual drop.

I don't think it's possible for a pair of devices to share a common cookie..however, a quick google search shows the following: This IP address is provided by your ISP and is assigned to the device that your modem is connected to, which is typically your router. Therefore, all devices connected to the router (whether wired or wirelessly) will share the same external IP address

CarbonCycles commented 9 years ago

Sorry meant IP address (not cookie)...are we dealing with a public ip address here?

CarbonCycles commented 9 years ago

Check this article:

http://adexchanger.com/data-exchanges/a-marketers-guide-to-cross-device-identity/

Notice that the cookie gets reset when the browser closes..this could be why we are seeing a crap load of cookies!

CarbonCycles commented 9 years ago

For grins, I ran a quick pearson correlation on the training set. I saw a nice correlation between anony_5 & 7 of close to 0.7

df_dev_train_basic.ix[8:11].corr(method='pearson') Out[12]: anonymous_c0 anonymous_5 anonymous_6 anonymous_7 anonymous_c0 1.000000 0.355279 0.333333 -0.297080 anonymous_5 0.355279 1.000000 0.231703 0.718382 anonymous_6 0.333333 0.231703 1.000000 0.351095 anonymous_7 -0.297080 0.718382 0.351095 1.000000

There is an interaction term at play.

CarbonCycles commented 9 years ago

Okay, this is pretty weird. I wanted to get a count of different unique values exist for each of the different categories (remember we have 1555795 unique drawbridge handles) from the cookie_basic.csv.

We have 206 countries with 251 different OS and 1669 different browsers...ah?

len(df_cookie_all_basic.computer_os_type.unique()) Out[22]: 251

len(df_cookie_all_basic.computer_browser_version.unique()) Out[23]: 1669

len(df_cookie_all_basic.country.unique()) Out[24]: 206

len(df_cookie_all_basic.anonymous_c0.unique()) Out[25]: 3

len(df_cookie_all_basic.anonymous_c1.unique()) Out[26]: 1422

len(df_cookie_all_basic.anonymous_c2.unique()) Out[27]: 31685

len(df_cookie_all_basic.anonymous_5.unique()) Out[28]: 151

len(df_cookie_all_basic.anonymous_6.unique()) Out[29]: 139

len(df_cookie_all_basic.anonymous_7.unique()) Out[30]: 71

mlandry22 commented 9 years ago

I've got it!!!!

We've forgotten the drawbridge ID all along. Ok, let's walk through an example, using our best submission file.

Line 3 of our current submission file: id_1000035 [id_2748391 id_265577]

So we look up the drawbridge handle for the first cookie, id_2748391, and see handle_590888. Now we look up that handle, and find that there are three cookies. We got two right, but there is a third:

So we should feel really confident (1) that we got those two right; (2) that we need to add the third cookie.

This will be huge!