Closed SudalaiRajkumar closed 9 years ago
Copy pasting a part of Mark's email here
I think the simplest thing for us to get ready are counts of anything you can think of. Here is a good list of things that will help our paper be interesting even if we did the same stuff as other presenters. We won't use it all, but it would be nice to have these on hand:
Intro: count of cookies, count of devices, count of cookies with drawbridge handle, count of unique drawbridge handles, count of devices with drawbridge handle, count of unique drawbridge handles in devices set, count of devices with no handle.
First pass: How large do the dictionaries get? What are some specifics as far as devices and cookies and distribution of cookies/device in that first file.
Final model stuff: at layer 1, how many cookie:device pairs did we use. What were the counts we used in the various dev/val splits. What were the counts used for the ifnal test. Counts would be counts of devices, counts of device:cookies (i.e. total records). At layer 1, pick some bins of probabilities and count up. Perhaps 5% bins, and see the frequency in counts.
At layer 2, show raw counts in the same dev/val/test splits, which should be slightly different than layer 1since we added cookies sharing handles. At layer 2, the same. We reused val, and split it, so what were those counts. At layer 2, pick some bins of probabilities and count up as with layer 1 Compare layer 2 vs layer 1 movement: change in probabilities. Even better, show the accuracy of those changes. Pick a reasonable metric and show how those that have increased probabilities are more accurate.
Final predictions: use our best file and count up total cookies we provided. Show distribution of cookies-per-device in our submission. Also show the number of unique handles provided per device (where we will certainly be wrong about one).
Variable | Count |
---|---|
Count of cookies | 2,175,520 |
Count of cookies with drawbridge handle not equals "-1" | 1,643,821 |
Count of cookies with drawbridge handle equals "-1" | 531,699 |
Count of unique drawbridge handles after excluding "-1" | 1,555,795 |
Count of devices in training set | 142,770 |
Count of unique drawbridge handles associated with devices in training set | 139,419 |
Count of devices in training set with drawbridge handle equals -1 or empty | 0 |
Count of unique drawbridge handles associated with devices in test set | 61,156 |
Table on Development and Validation data split
Variable | Value |
---|---|
Dev-Val Sample Percentage | 80%-2% |
Count of devices in Dev sample | 114,234 |
Count of devices in Val sample | 28,536 |
Type of sampling | Random |
No of cookies associated with dev sample devices : 143970
Cutoff | No of cookies captured | % of cookies captured | Total no of device-cookie pairs to check |
---|---|---|---|
10 | 126,196 | 87.65% | 946,987 |
20 | 135,457 | 94.09% | 2,647,903 |
30 | 137,815 | 95.72% | 5,967,693 |
50 | 139,579 | 96.95% | 9,268,530 |
100 | 140,529 | 97.61% | 18,952,539 |
200 | 141,238 | 98.10% | 60,909,729 |
Infinity | 141,324 | 98.16% | 82,591,207 |
We used a cutoff of 30 (i.e if the number of cookies associated with an IP address is more than this cutoff, we are not including the cookies from these IPs in device-cookie pairs considering them as public IPs) and the numbers in each of our sample is as follows:
Sample | No. of device cookie pairs |
---|---|
Dev sample | 5,967,693 |
Val sample | 1,482,988 |
Test sample | 3,114,717 |
We have sub-sampled the dev sample to run the models. We have kept all the 1's as such and sub-sampled only 0's (we chose 1 out of every 6 0's approx - term is negative sub-sampling I think). So the total number of rows in dev sample for model building is 1,110,196
We got a validation sample AUC of 0.947 and the score ranges at different percentage bins are:
> as.data.frame(table(cut(val$prediction, breaks=seq(0,1, by=0.05))))
Var1 Freq
1 (0,0.05] 1376887
2 (0.05,0.1] 13888
3 (0.1,0.15] 8841
4 (0.15,0.2] 6293
5 (0.2,0.25] 5136
6 (0.25,0.3] 4403
7 (0.3,0.35] 3793
8 (0.35,0.4] 3406
9 (0.4,0.45] 3212
10 (0.45,0.5] 3064
11 (0.5,0.55] 2978
12 (0.55,0.6] 3058
13 (0.6,0.65] 3086
14 (0.65,0.7] 3472
15 (0.7,0.75] 3861
16 (0.75,0.8] 4605
17 (0.8,0.85] 5506
18 (0.85,0.9] 6748
19 (0.9,0.95] 7218
20 (0.95,1] 13533
Wow, that's excellent. That is a lot of good data to be able to use. For now, I would say I need to catch up a little bit on getting the words filled out and these accurate figures included. Then, I'll turn over the document to all three of us. Rob, that's probably the best place for you to get involved: see what corrections you think should be made, or add descriptors.
Or, Rob, a nice diagram showing how this stage-wise thing moves forward would be nice. I tried something that seemed to make sense in my head until I got it down on paper and realized most of it is useless. Nonetheless, here is an example of something that isn't terribly useful and incomplete because it just wasn't making much sense.
Tossing up notes on reorganization
Use Rob's here
I added the latest tex version to the main branch of the Github.
I haven't yet added the image, but you can take that from the email I sent. Will repost a better version at some point.
Info regarding final edits:
I have uploaded a .tex file with edits and comments in it. I am still in the process of reviewing, but I decided to throw that up now so you can take a look at it. I'm almost done, and hope to have at least one sweep completed soon.
All done here. In the next few days, I'll probably retire this repo. Retiring can mean making it public or deleting it. So please let me know your thoughts on either decision and how that would impact you. I believe that does not apply to Issues. So I think either way, you'll want anything out of the issues if you care (I like to save them). Code would then become public, if you both agreed, or will be gone forever otherwise. Then we can use this repo for the Springleaf one.
Mark
I am fine with making it public Mark. We shall get Rob's opinion and based on it we can decide.
Thanks, Sudalai
On Tue, Sep 15, 2015 at 3:24 AM, Mark Landry notifications@github.com wrote:
All done here. In the next few days, I'll probably retire this repo. Retiring can mean making it public or deleting it. So please let me know your thoughts on either decision and how that would impact you. I believe that does not apply to Issues. So I think either way, you'll want anything out of the issues if you care (I like to save them). Code would then become public, if you both agreed, or will be gone forever otherwise. Then we can use this repo for the Springleaf one.
Mark
— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/7#issuecomment-140216077.
I created a new issue thread for paper preparation. I will start with the easy thing to fill up first, the credits section :)
(1) name, as you want it to appear - Sudalai Rajkumar S (2) affiliation - Tiger Analytics (3) address - Guindy, Chennai, India (4) correspondence email - sudalai@tigeranalytics.com (5) correspondence phone - +91 995 233 2232