ICDM Paper - Githubissues

SudalaiRajkumar commented 9 years ago

I created a new issue thread for paper preparation. I will start with the easy thing to fill up first, the credits section :)

(1) name, as you want it to appear - Sudalai Rajkumar S (2) affiliation - Tiger Analytics (3) address - Guindy, Chennai, India (4) correspondence email - sudalai@tigeranalytics.com (5) correspondence phone - +91 995 233 2232

SudalaiRajkumar commented 9 years ago

Copy pasting a part of Mark's email here

I think the simplest thing for us to get ready are counts of anything you can think of. Here is a good list of things that will help our paper be interesting even if we did the same stuff as other presenters. We won't use it all, but it would be nice to have these on hand:

Intro: count of cookies, count of devices, count of cookies with drawbridge handle, count of unique drawbridge handles, count of devices with drawbridge handle, count of unique drawbridge handles in devices set, count of devices with no handle.

First pass: How large do the dictionaries get? What are some specifics as far as devices and cookies and distribution of cookies/device in that first file.

Final model stuff: at layer 1, how many cookie:device pairs did we use. What were the counts we used in the various dev/val splits. What were the counts used for the ifnal test. Counts would be counts of devices, counts of device:cookies (i.e. total records). At layer 1, pick some bins of probabilities and count up. Perhaps 5% bins, and see the frequency in counts.

At layer 2, show raw counts in the same dev/val/test splits, which should be slightly different than layer 1since we added cookies sharing handles. At layer 2, the same. We reused val, and split it, so what were those counts. At layer 2, pick some bins of probabilities and count up as with layer 1 Compare layer 2 vs layer 1 movement: change in probabilities. Even better, show the accuracy of those changes. Pick a reasonable metric and show how those that have increased probabilities are more accurate.

Final predictions: use our best file and count up total cookies we provided. Show distribution of cookies-per-device in our submission. Also show the number of unique handles provided per device (where we will certainly be wrong about one).

SudalaiRajkumar commented 9 years ago

Variable	Count
Count of cookies	2,175,520
Count of cookies with drawbridge handle not equals "-1"	1,643,821
Count of cookies with drawbridge handle equals "-1"	531,699
Count of unique drawbridge handles after excluding "-1"	1,555,795
Count of devices in training set	142,770
Count of unique drawbridge handles associated with devices in training set	139,419
Count of devices in training set with drawbridge handle equals -1 or empty	0
Count of unique drawbridge handles associated with devices in test set	61,156

SudalaiRajkumar commented 9 years ago

Table on Development and Validation data split

Variable	Value
Dev-Val Sample Percentage	80%-2%
Count of devices in Dev sample	114,234
Count of devices in Val sample	28,536
Type of sampling	Random

SudalaiRajkumar commented 9 years ago

No of cookies associated with dev sample devices : 143970

Cutoff	No of cookies captured	% of cookies captured	Total no of device-cookie pairs to check
10	126,196	87.65%	946,987
20	135,457	94.09%	2,647,903
30	137,815	95.72%	5,967,693
50	139,579	96.95%	9,268,530
100	140,529	97.61%	18,952,539
200	141,238	98.10%	60,909,729
Infinity	141,324	98.16%	82,591,207

SudalaiRajkumar commented 9 years ago

We used a cutoff of 30 (i.e if the number of cookies associated with an IP address is more than this cutoff, we are not including the cookies from these IPs in device-cookie pairs considering them as public IPs) and the numbers in each of our sample is as follows:

Sample	No. of device cookie pairs
Dev sample	5,967,693
Val sample	1,482,988
Test sample	3,114,717

SudalaiRajkumar commented 9 years ago

We have sub-sampled the dev sample to run the models. We have kept all the 1's as such and sub-sampled only 0's (we chose 1 out of every 6 0's approx - term is negative sub-sampling I think). So the total number of rows in dev sample for model building is 1,110,196

We got a validation sample AUC of 0.947 and the score ranges at different percentage bins are:

> as.data.frame(table(cut(val$prediction, breaks=seq(0,1, by=0.05))))
         Var1    Freq
1    (0,0.05] 1376887
2  (0.05,0.1]   13888
3  (0.1,0.15]    8841
4  (0.15,0.2]    6293
5  (0.2,0.25]    5136
6  (0.25,0.3]    4403
7  (0.3,0.35]    3793
8  (0.35,0.4]    3406
9  (0.4,0.45]    3212
10 (0.45,0.5]    3064
11 (0.5,0.55]    2978
12 (0.55,0.6]    3058
13 (0.6,0.65]    3086
14 (0.65,0.7]    3472
15 (0.7,0.75]    3861
16 (0.75,0.8]    4605
17 (0.8,0.85]    5506
18 (0.85,0.9]    6748
19 (0.9,0.95]    7218
20   (0.95,1]   13533

mlandry22 commented 9 years ago

Wow, that's excellent. That is a lot of good data to be able to use. For now, I would say I need to catch up a little bit on getting the words filled out and these accurate figures included. Then, I'll turn over the document to all three of us. Rob, that's probably the best place for you to get involved: see what corrections you think should be made, or add descriptors.

Or, Rob, a nice diagram showing how this stage-wise thing moves forward would be nice. I tried something that seemed to make sense in my head until I got it down on paper and realized most of it is useless. Nonetheless, here is an example of something that isn't terribly useful and incomplete because it just wasn't making much sense.

mlandry22 commented 9 years ago

Tossing up notes on reorganization

Abstract

Intro

Use Rob's here

Methods

Competition Setup
Modeling (keep brief, no features, just modeling)
- IP similarity
- 30 threshold
- cookie:device classification 1
- cookie: device classification 2
- final selection
- gap model: device:device

Calculations

features
- cookie:device 1
- cookie:device 2
- device:device
implementation
- python for features
- xgboost in python for modeling 1 & 2
- h2o and R for gap model

Results

Overview
Chronological improvement: use leaderboard and our notes for tracking ideas

Conclusions

Re-summarize: use original intro for this section
Comments: intrigued by requirement to have heuristic to prune initial data

mlandry22 commented 9 years ago

I added the latest tex version to the main branch of the Github.

I haven't yet added the image, but you can take that from the email I sent. Will repost a better version at some point.

Info regarding final edits:

Template I was instructed to follow on the first version: http://icetest.gectcr.ac.in/sample.pdf
Guidance on camera ready:
- http://www.ieeeconfpublishing.org/cpir/authorKit.asp?Facility=CPS_Dec&ERoom=ICDMW+2015
- Little guidance on actual formatting at the link
- Can use PDFeXpress for free to check that things are ok from formatting perspective (if not too late; real deadilne was 8/30)
Conference registration:
- http://icdm2015.stonybrook.edu/attending/registration
- When you register, you will need a paper ID to identify yourself and your team. Since we handle the submission via email rather than online portal as other workshop, the paper ID will be "DMC_$yourFinalRankingInPrivateBoard". In your case, your team ID is "DMC_10"

CarbonCycles commented 9 years ago

I have uploaded a .tex file with edits and comments in it. I am still in the process of reviewing, but I decided to throw that up now so you can take a look at it. I'm almost done, and hope to have at least one sweep completed soon.

mlandry22 commented 9 years ago

All done here. In the next few days, I'll probably retire this repo. Retiring can mean making it public or deleting it. So please let me know your thoughts on either decision and how that would impact you. I believe that does not apply to Issues. So I think either way, you'll want anything out of the issues if you care (I like to save them). Code would then become public, if you both agreed, or will be gone forever otherwise. Then we can use this repo for the Springleaf one.

Mark

SudalaiRajkumar commented 9 years ago

I am fine with making it public Mark. We shall get Rob's opinion and based on it we can decide.

Thanks, Sudalai

On Tue, Sep 15, 2015 at 3:24 AM, Mark Landry notifications@github.com wrote:

All done here. In the next few days, I'll probably retire this repo. Retiring can mean making it public or deleting it. So please let me know your thoughts on either decision and how that would impact you. I believe that does not apply to Issues. So I think either way, you'll want anything out of the issues if you care (I like to save them). Code would then become public, if you both agreed, or will be gone forever otherwise. Then we can use this repo for the Springleaf one.

Mark

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/7#issuecomment-140216077.

mlandry22 / icdm-2015

ICDM Paper #7

Abstract

Intro

Methods

Calculations

Results

Conclusions