mlandry22 / icdm-2015

IDCM 2015 Kaggle Competition
1 stars 2 forks source link

ICDM Paper #7

Closed SudalaiRajkumar closed 9 years ago

SudalaiRajkumar commented 9 years ago

I created a new issue thread for paper preparation. I will start with the easy thing to fill up first, the credits section :)

(1) name, as you want it to appear - Sudalai Rajkumar S (2) affiliation - Tiger Analytics (3) address - Guindy, Chennai, India (4) correspondence email - sudalai@tigeranalytics.com (5) correspondence phone - +91 995 233 2232

SudalaiRajkumar commented 9 years ago

Copy pasting a part of Mark's email here

I think the simplest thing for us to get ready are counts of anything you can think of. Here is a good list of things that will help our paper be interesting even if we did the same stuff as other presenters. We won't use it all, but it would be nice to have these on hand:

Intro: count of cookies, count of devices, count of cookies with drawbridge handle, count of unique drawbridge handles, count of devices with drawbridge handle, count of unique drawbridge handles in devices set, count of devices with no handle.

First pass: How large do the dictionaries get? What are some specifics as far as devices and cookies and distribution of cookies/device in that first file.

Final model stuff: at layer 1, how many cookie:device pairs did we use. What were the counts we used in the various dev/val splits. What were the counts used for the ifnal test. Counts would be counts of devices, counts of device:cookies (i.e. total records). At layer 1, pick some bins of probabilities and count up. Perhaps 5% bins, and see the frequency in counts.

At layer 2, show raw counts in the same dev/val/test splits, which should be slightly different than layer 1since we added cookies sharing handles. At layer 2, the same. We reused val, and split it, so what were those counts. At layer 2, pick some bins of probabilities and count up as with layer 1 Compare layer 2 vs layer 1 movement: change in probabilities. Even better, show the accuracy of those changes. Pick a reasonable metric and show how those that have increased probabilities are more accurate.

Final predictions: use our best file and count up total cookies we provided. Show distribution of cookies-per-device in our submission. Also show the number of unique handles provided per device (where we will certainly be wrong about one).

SudalaiRajkumar commented 9 years ago
Variable Count
Count of cookies 2,175,520
Count of cookies with drawbridge handle not equals "-1" 1,643,821
Count of cookies with drawbridge handle equals "-1" 531,699
Count of unique drawbridge handles after excluding "-1" 1,555,795
Count of devices in training set 142,770
Count of unique drawbridge handles associated with devices in training set 139,419
Count of devices in training set with drawbridge handle equals -1 or empty 0
Count of unique drawbridge handles associated with devices in test set 61,156
SudalaiRajkumar commented 9 years ago

Table on Development and Validation data split

Variable Value
Dev-Val Sample Percentage 80%-2%
Count of devices in Dev sample 114,234
Count of devices in Val sample 28,536
Type of sampling Random
SudalaiRajkumar commented 9 years ago

No of cookies associated with dev sample devices : 143970

Cutoff No of cookies captured % of cookies captured Total no of device-cookie pairs to check
10 126,196 87.65% 946,987
20 135,457 94.09% 2,647,903
30 137,815 95.72% 5,967,693
50 139,579 96.95% 9,268,530
100 140,529 97.61% 18,952,539
200 141,238 98.10% 60,909,729
Infinity 141,324 98.16% 82,591,207
SudalaiRajkumar commented 9 years ago

We used a cutoff of 30 (i.e if the number of cookies associated with an IP address is more than this cutoff, we are not including the cookies from these IPs in device-cookie pairs considering them as public IPs) and the numbers in each of our sample is as follows:

Sample No. of device cookie pairs
Dev sample 5,967,693
Val sample 1,482,988
Test sample 3,114,717
SudalaiRajkumar commented 9 years ago

We have sub-sampled the dev sample to run the models. We have kept all the 1's as such and sub-sampled only 0's (we chose 1 out of every 6 0's approx - term is negative sub-sampling I think). So the total number of rows in dev sample for model building is 1,110,196

We got a validation sample AUC of 0.947 and the score ranges at different percentage bins are:

> as.data.frame(table(cut(val$prediction, breaks=seq(0,1, by=0.05))))
         Var1    Freq
1    (0,0.05] 1376887
2  (0.05,0.1]   13888
3  (0.1,0.15]    8841
4  (0.15,0.2]    6293
5  (0.2,0.25]    5136
6  (0.25,0.3]    4403
7  (0.3,0.35]    3793
8  (0.35,0.4]    3406
9  (0.4,0.45]    3212
10 (0.45,0.5]    3064
11 (0.5,0.55]    2978
12 (0.55,0.6]    3058
13 (0.6,0.65]    3086
14 (0.65,0.7]    3472
15 (0.7,0.75]    3861
16 (0.75,0.8]    4605
17 (0.8,0.85]    5506
18 (0.85,0.9]    6748
19 (0.9,0.95]    7218
20   (0.95,1]   13533
mlandry22 commented 9 years ago

Wow, that's excellent. That is a lot of good data to be able to use. For now, I would say I need to catch up a little bit on getting the words filled out and these accurate figures included. Then, I'll turn over the document to all three of us. Rob, that's probably the best place for you to get involved: see what corrections you think should be made, or add descriptors.

Or, Rob, a nice diagram showing how this stage-wise thing moves forward would be nice. I tried something that seemed to make sense in my head until I got it down on paper and realized most of it is useless. Nonetheless, here is an example of something that isn't terribly useful and incomplete because it just wasn't making much sense.

screen shot 2015-09-02 at 11 06 58 am
mlandry22 commented 9 years ago

Tossing up notes on reorganization

Abstract

Intro

Use Rob's here

Methods

Calculations

Results

Conclusions

mlandry22 commented 9 years ago

I added the latest tex version to the main branch of the Github.

I haven't yet added the image, but you can take that from the email I sent. Will repost a better version at some point.

Info regarding final edits:

CarbonCycles commented 9 years ago

I have uploaded a .tex file with edits and comments in it. I am still in the process of reviewing, but I decided to throw that up now so you can take a look at it. I'm almost done, and hope to have at least one sweep completed soon.

mlandry22 commented 9 years ago

All done here. In the next few days, I'll probably retire this repo. Retiring can mean making it public or deleting it. So please let me know your thoughts on either decision and how that would impact you. I believe that does not apply to Issues. So I think either way, you'll want anything out of the issues if you care (I like to save them). Code would then become public, if you both agreed, or will be gone forever otherwise. Then we can use this repo for the Springleaf one.

Mark

SudalaiRajkumar commented 9 years ago

I am fine with making it public Mark. We shall get Rob's opinion and based on it we can decide.

Thanks, Sudalai

On Tue, Sep 15, 2015 at 3:24 AM, Mark Landry notifications@github.com wrote:

All done here. In the next few days, I'll probably retire this repo. Retiring can mean making it public or deleting it. So please let me know your thoughts on either decision and how that would impact you. I believe that does not apply to Issues. So I think either way, you'll want anything out of the issues if you care (I like to save them). Code would then become public, if you both agreed, or will be gone forever otherwise. Then we can use this repo for the Springleaf one.

Mark

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/7#issuecomment-140216077.