Closed mlandry22 closed 9 years ago
Very interesting article Rob!
Great work Mark.! So this is the missing piece then? :D
I think so. Almost definitely worth several spots, because about 75% of all cookies have known drawbridge handles. We don't always know to what device, but we often know what cookies should be included as a set. I will be so tempted to try it out that I might make a mistake. But here is how I envision it playing out:
I've been going back and forth between two laptops and don't have the one with the probabilities on it right now. But I'll parse our best submission thus far and try and quantify the situations as described above. Unless I get too tired, I will be greatly excited to submit the first new file, unless my quantification shows that the case I pulled out is in fact a rare one that it is right.
Great finding Mark.. Am super excited as well.. After such a hard work we had put into it.!
Really eager to see the results. Please go ahead and make one (or if needed both) submission(s) using the new logic and see the results. :)
Thanks. I certainly will.
Once we see how much one path can get us, we certainly have some other options of how to use this to our advantage. And it is perfectly valid to use these methods in our validation file. We can test methodologies perhaps, such as the decisions I listed out above.
Here is how the first 10 rows looks, when just obtaining the drawbridge IDs (not yet finding other cookies). I have this data for the entire first submission. Just need to store the results of the crosswalks and I'll be close to having a revised submission.
device_id cookie_id cookie1 cookie2 cookie3 handle1 handle2 handle3
1 id_1 id_1016089 id_1016089 <NA> <NA> handle_221899
2 id_100002 id_3864592 id_3864592 <NA> <NA> handle_1611691
3 id_1000035 id_2748391 id_265577 id_2748391 id_265577 <NA> handle_590888 handle_590888
4 id_1000099 id_856312 id_595924 id_856312 id_595924 <NA> handle_678019 handle_678019
5 id_1000294 id_1934707 id_3757139 id_1934707 id_3757139 <NA> handle_1420206 -1
6 id_1000305 id_614216 id_614216 <NA> <NA> handle_1083187
7 id_1000310 id_2357857 id_2357857 <NA> <NA> handle_1336744
8 id_1000414 id_1176794 id_1176794 <NA> <NA> handle_1265289
9 id_1000497 id_1127125 id_2640273 id_1869275 id_1127125 id_2640273 id_1869275 handle_2069288 handle_304997 handle_1305316
10 id_1000594 id_4588309 id_3767904 id_3168726 id_4588309 id_3767904 id_3168726 -1 handle_1059510 handle_113791
So line 4 looks like line 3, though in this case, the handle is connected to both of our guesses, so we should feel quite confident. Line 5 is interesting, it has a known drawbridge handle and a -1. For today, we probably leave it as is. Line 7 crosswalks to an additional cookie we hadn't picked up. Line 9 is where we can really improve, if we can get it right. These have three differing drawbridge handles. Ideally, we'd go back to the probabilities and ensure that they are all good. If one was at 99% and the others were at 80% and 65%, we might now have incentive to cut out the other two, knowing that they 2 of the 3 are guaranteed to be wrong.
Validation can tell us that. So I would expect future work to work on a second-level algorithm, but not one like we had. This time something to take advantage of what we now know, after getting better clues from the drawbridge handles and trying to make all-in/all-out type decisions, particularly when we have multiple different cookies. Line 9 is capped at 33% precision, since 2 are guaranteed to be wrong. But, if we throw out the wrong ones, we can go down to 0% precision and 0% recall. So we don't want to just guess in these scenarios. But hopefully we can find a decent statistical model that works on a different dimension, and utilizes those first-level probabilities.
I made a small ugly script to check our hypothesis (SRK/getMetric_new_v2.py). I basically added a new function in our existing script, to get the cookies using the drawbridge handle. I tried implementing some of the suggestions given by Mark. This is just a basic function and we could modify this a lot to get a better one I think. Also the code is messy since I just added to the existing script. Sorry about that. If this works well, let me re-write the code so that it will be easy for us to play around with the parameters.
We got about 0.78 val sample F0.5 score when I included the cookies with "drawbridge_handle = -1". When I didn't include them we got about 0.835. This is because in val sample, no cookie from "drawbridge=-1" is present, since we prepared the DV itself by using the walk over from device to cookie using drawbridge_handle :)
If those cookies with "drawbridge_handle=-1" are truly unknown in most of the cases in test set as well, then using the same function must give us something around 0.83 LB score (My hunch says that this is what is happening and so top people are safely removing cookie with drawbridge=-1 in their models). If not we need to include those cookies with unknown handles as well in our predictions.
Sudalai,
I think you're right w/ the missing or null drawbridge handles. I was curious on how that was being handled.
I still think we are missing some vital information based on some of the research I pulled last night. Going back to what Mark stated and looking as his latest table:
A handle can belong to more than one device. --> True A handle can belong to more than one cookie. --> True But a cookie cannot belong to more than one handle. --> True And a device cannot belong to more than one handle. --> Is this true...couldn't find a clear representation Yet, these things surely do really happen.
Based off what we are seeing, I think the biggest bang will come from Sudalai's removal of the -1s.
I'm arguing (maybe completely off based) the following: 1) We are seeing public facing IP addresses, which is why we have so many devices using common IPs via router, modem, etc. There are probably a few cases where we get the actual device ID with IP (i.e., low counts). 2) Based off the article linked, for mobile devices, the cookie id is reset after the session is closed...thereby giving a huge number of transient cookie ids 3) The article hints heavily at the use of other anonymous factors such as the OS type, browser version, etc.
Long story short, I'm really impressed with the F score using data that is both transient and difficult to correlate...very impressed.
I personally think if someone were to submit with the -1s removed that the score will go up a few places. In addition, I think the cookie basic table needs a closer inspection. Are you guys aware of the use of dummy variables to help expand categorical columns?
Crap, late for work...got to leave for now.
Rpb
Guys,
One more thought on the use of logic gates as I was driving in as it relates to Mark's comments on Drawbridge Id.
There are a few use case scenarios we could use to leverage how to proceed through the gates:
Case 1: All desktops...
Case 2: Desktop to mobile (use is IP cellular or IP wifi)
Case 3: Mobile to mobile (use is IP cellular or IP wifi)
Case 4: Mobile to Desktop
I'm assuming we don't care about Case 1...not interesting and not relevant to what I am guessing Drawbridge is trying to do...identify mobile users via cookie and serve ads during the session (i.e., near real time sessionization of ads). Not sure if we even care about Case 4 or if Case 4 is just Case 3..it may not be possible to determine the direction since we aren't given a time stamp.
We could add a very quick logic gate that says, if ALL your cookies appears to have come from a desktop...skip (i.e., do not waste your time processing the information).
Gate 2 would be to carry along additional columns: Check for IP cellular, check for OS and browser information. Add that to the current cookie and ip fields.
Again guys, sorry for really over thinking this thing, but I feel like we are leaving a lot of easy money on the table right now....kind of like playing roulette and always betting black.
Rob
On Thu, Aug 20, 2015 at 4:12 AM, SudalaiRajkumar notifications@github.com wrote:
I made a small ugly script to check our hypothesis (SRK/getMetric_new_v2.py). I basically added a new function in our existing script, to get the cookies using the drawbridge handle. I tried implementing some of the suggestions given by Mark. This is just a basic function and we could modify this a lot to get a better one I think. Also the code is messy since I just added to the existing script. Sorry about that. If this works well, let me re-write the code so that it will be easy for us to play around with the parameters.
We got about 0.78 val sample F0.5 score when I included the cookies with "drawbridge_handle = -1". When I didn't include them we got about 0.835. This is because in val sample, no cookie from "drawbridge=-1" is present, since we prepared the DV itself by using the walk over from device to cookie using drawbridge_handle :)
If those cookies with "drawbridge_handle=-1" are truly unknown in most of the cases in test set as well, then using the same function must give us something around 0.83 LB score (My hunch says that this is what is happening and so top people are safely removing cookie with drawbridge=-1 in their models). If not we need to include those cookies with unknown handles as well in our predictions.
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/4#issuecomment-132947096.
Sudalai, you've beat me to a submission, and I feel more confident you'll get it right, so perhaps we try both methods and see what happens?
The organizers tell us that the -1's are a result of either cookies they really do not know about, and some cookies they have purposely removed the ID from. At 25%, it's hard to know what to do about that, but we can try today's submissions on both sides, collect the gain/loss and then make an educated decision about it throughout the rest of the day. With some people at 0.89, I would try the "no -1's" first (unless we don't have a better guess).
I think a middle ground would be to attach a negative weight on any -1 such that we can still use it if we are really confident, but not otherwise.
Rob, you know that all cookies come from non-mobile systems and all devices come from mobile systems, right? No overlap, per the organizers. I have been surprised that there isn't more power to the is_cellular tag in the IP. I think Sudalai has features already that cover a lot of what you're mentioning, but perhaps you can show an example where such a feature would help and isn't covered?
With just a few days left, my guess is that experimenting with how to use the drawbridge_id properly is going to be worth a lot of F0.5. Though we already know from Sudalai's investigation that we're capped at 0.83 so far, and if any -1's are valid, we'll be short of that. But still, there are other ways to try and use the data. For example, with all these known drawbridge_handles, we will likely wind up with many cookies that span devices, so that should help us choose one drawbridge_handle over another. If we can get that far.
Every time....chain posts from Mark.
Sudalai, for whatever file we submit, we also might want to use your sub3.csv to fill in some of the id_10's. The small sample I looked at had real-looking cookies for all places we needed some. So it's almost certainly better than using id_10.
I added just those records as "id_10_fixes.csv" in the SRK folder. That should make it easy to use. As a last resort, consult that file to get some cookies to use for those 703 devices where we have no answer right now. It is pulled from sub3.csv, but if we want to improve it, we can just change that CSV to be our best guesses.
Mark,
As you mentioned, let us try the file with no -1's first. It will help us make an educated guess. And we can add the values from sub3.csv for missing ones.
Please go ahead and make a submission and let us know :)
OK, I will do that. I'm quickly going to post an updated version of id_10_fixes.csv that also uses the drawbridge ID to get more based on the handles. So 703 goes to 769, but I'll condense the cookie_id to be a space-delimited. I suppose it depends on where it is used in which format is best, and it's a tiny file, so I'll post both versions.
Also I have uploaded the test_predictions.7z and val_predictions.7z file in the shared google drive in case if needed.
I keep getting a 0-byte sub19.csv file when I run the code. Will look into it, but figured I'd alert in case you know of an obvious thing.
Oh, ha. It's commented out, right?
Hehe yes. Sorry about that. I was just trying out various params and so commented out.!
Mark,
Yea, I do recall you mentioning that early on...getting way wrapped around the axle here. I went bouncing through the forum looking for some indicators here, and Sudalai, the following regarding Drawbridge Handle being -1 may be of interest to you (assuming you haven't already read this):
Sudalai, quick question...does the feature selection use Anonymous 5 within it?
R
On Thu, Aug 20, 2015 at 12:53 PM, SudalaiRajkumar notifications@github.com wrote:
Also I have uploaded the test_predictions.7z and val_predictions.7z file in the shared google drive in case if needed.
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/4#issuecomment-133097572.
While we're all chiming in... Sudalai, do you know what happens if our only prediction was a -1 handle? Hopefully we keep it?
Mark, just to make sure, are you getting a val F0.5 score of 0.83 when you run? yes we are keeping it as such when all are -1. There are about 302 cases in val sample like that. Hope this helps..
Rob, thanks for the link. Yes, they mentioned some of the handles are removed by them. We are not sure how many? That is why probably I think we need to make two submisisons one completely removing it and one having them as well.
YAY!!!!!
We went up 11 spots to 12th place. F0.5 of 0.831771
That is really awesome !!!!!!!!!!!!!
Great finding Mark :)
Rob, yes we are using anonymous 5 in our features. We are using all the variables but for property and category related ones.
Now we can all relax for a bit about what we haven't found. Here is an abbreviated version of where we sit: 89, 1 team 87, 2 teams 85, 7 teams 83, 2 teams (including us)
So 0.03 improvement will vault us into 4th. We no longer need a huge finding to gain a lot of spots. So we can try and compare smaller ideas now as well. I'll probably try and put some thought into ways to improve on this new finding. But thanks so much Sudalai for getting a working version so quickly. My R code was painfully going step by step.
Thanks Mark. There are certainly improvements we could make to the code. Some initial thoughts are
Sure. Now that we have crossed the 0.8 hurdle, let us aim for # 4 :)
Probably we should also try a version where we attach some of our high scoring cookies (say 0.95 or 0.98 and above) with drawbridge handle as "-1" to our existing best prediction file and try to see the result. From what we see in the val sample, false positives are very low when the prediction score is high.
If we get a better score in the new submission, we can assume that they have really removed some handles from cookie file and we can also include cookies with handle=-1 in further submissions. Or if we get a bad score, then we can safely assume all cookies with handle as -1 are truly unknown and we can use our val sample score to gauge our performance without needing to submit to LB often.
Please let me know your suggestions.
Very cool and congrats! I kind of feel like I'm not being very helpful :(
On Aug 20, 2015, at 1:23 PM, SudalaiRajkumar notifications@github.com wrote:
Thanks Mark. There are certainly improvements we could make to the code. Some initial thoughts are
- We could probably find some drawbridge handle for those cookies with -1 as well.
- Now we have a hard cutoff of selecting top 5 cookies with a difference threshold of 0.35. probably we could come up with something better here.
- When there are more than one drawbridge handles with same number of cookies in prediction, we are currently using cookies from both the handles. Probably we could try to have the cookies from high scoring cookie's handle alone if the difference in scores between the cookies are high.
- May be not part of this script, but better matching for the missing 702 cookies.
Sure. Now that we have crossed the 0.8 hurdle, let us aim for #4 :)
— Reply to this email directly or view it on GitHub.
Rob, no worries. You have really brought up some interesting ideas. It is just that this competition was very cruel to us at least till our last submission.
I really like the collaboration we have in our team. I would love to team up with both of you guys again in another competition if you both are interested :)
I also like the collaboration. I know I've written way more words on here than were read carefully, but it helps to try and articulate ideas to other people--you catch oversights that way and think of new things. And then good ideas can come in small forms as well: Rob listing the count of drawbridge IDs helped speed up my answer to the puzzle.
We have 9 submissions left. That's not many. And right now there are some useful things we can do. However, I realize we have a huge concept that we don't know anything about short of submitting, which is the nature of the -1. I think our CV score being so close to Kaggle score tells us they are mostly, if not entirely, noise. But with 45 minutes to go, I decided to drop them from the entire submission to see the impact. The closer the score to our current one, the more we can be aggressive in finding alternatives when a -1 shows up in our list of best cookies. So it's a burned submission, but so far, our gains have always had a good rationale and testable, so I don't think doing anything else with today's second submission will be too insightful.
Wow! No change whatsoever. Same score after removing even the predictions that had only a -1 as the best cookie.
This is insightful. In its simplest form, I think we should retrain our XGBoost models with the -1's removed. It might be useful to get rid of them everywhere, even, but I'm not sure. So first person to add the Drawbridge ID in the main feature creation code will be very helpful, so we know what to do. Removing those will change the calculations, too, which is perhaps what we want.
This thread is huge, so I'm going to close it and start a new one for the last few days. Expect it to be filled with ways we can exploit drawbridge_ids!
Thread to capture ideas of how we can improve our score. Sudalai has been doing great work implementing some great features. How should we focus our future work to get our score into the top few?
Ideally, this will be like a brainstorm. One post per idea, and maybe some discussion or something. But in its best form, it would become a list to constantly scan over and see what ideas we might want to try.