mlandry22 / icdm-2015

IDCM 2015 Kaggle Competition
1 stars 2 forks source link

Last Four Days #6

Closed mlandry22 closed 9 years ago

mlandry22 commented 9 years ago

New thread for our final efforts.

mlandry22 commented 9 years ago

If I'm up to some python coding, here are a few things I'm going to try and implement:

Add drawbridge ID to the core data set When the handle is in the file, then it can be easily tried with/without the -1s (e.g. grep "handle") In the device loop, count the number of other cookies sharing the current cookie's drawbridge handle.

mlandry22 commented 9 years ago

I hacked a way to get the second idea in, and it bumped our local CV up. On the submission I made this afternoon: 0.8367 local and in Kaggle 0.831771

I just removed all the -1 cookies directly from the dev/val/test predictions files: 0.84129 local If we got the same basis improvement (0.00459) it would suggest a score of 0.836. That's worth one spot. But only one. To me, not worth 1 of 8 submissions just yet, but it does hint that it's the right thing to do.

Hack way of doing it quickly:

cookies<-fread("cookies.csv")
c<-cookies[,1:2,with=F]
s<-"test"
a<-fread(paste0(s,"_predictions.csv"))
a<-merge(a,c,by="cookie_id",all.x=T)
write.csv(a[drawbridge_handle!="-1",.(DV,cookie_id,device_id,prediction)],paste0(s,"_predictions.csv"),row.names=F)
SudalaiRajkumar commented 9 years ago

Thanks Mark. We will also see if we could do something which helps us gain those extra spots.

mlandry22 commented 9 years ago

Writing notes, mostly for my own tracking.

Simple dual model. Split dev into two parts. Get predictions for dev1, dev2, val, and test. Then create drawbridge-based features onto dev2, val, and test. Retrain. Check val to see if it's useful. Feature types: number of predicted drawbridge handles; total probability by drawbridge handle

Have a new "deeper" model to work harder on just those where we have no guess. That number has increased. For the original set, use the big server to do the deeper dive, just for those devices where we have nothing. For the others, perhaps see of those devices look like other devices somehow, and if so, use those cookies.

If we have scenarios where we have 1+ drawbridge handle found naturally, and then some others at a < X% (90?), remove the latter ones. Assume the drawbridge handle for the pair is correct. This would likely be unnecessary if we had the second-level model that added drawbridge features. But it's easier to implement.

Use sub3.csv to patch the 703 holes in the latest submissions. This is fairly easy to do as well, but I suspect it isn't worth a submission unless we were doing something else at the same time. This can't be cross-validated. Better to have a deeper model.

Learn what common devices look like and have a model for them. The answers can be found easily in the training set. Perhaps we use the features and probabilities of our current first model to try and find common devices.

mlandry22 commented 9 years ago

Gotta post two at a time, of course. We have never tried looking for devices sharing the same IP. That might be worth investigating as a really basic device:device model. If that happens on an IP where the main cookies are anyway, it probably isn't useful. But if it happens on an IP that isn't shared between any cookies, that could be useful.

SudalaiRajkumar commented 9 years ago

Okay. As usual I have created an ugly script again.!

I tried building a second level model using some of the information which we are talking here. I have updated the "SecondLevelModel" folder in github. I have used a new dev and val split and in the new val split it is getting around 0.8508.

CarbonCycles commented 9 years ago

Wow, you guys move fast. I had to check out and switch modes for an evening..anywhere I can help?

On Fri, Aug 21, 2015 at 7:25 AM, SudalaiRajkumar notifications@github.com wrote:

Okay. As usual I have created an ugly script again.!

I tried building a second level model using some of the information which we are talking here. I have updated the "SecondLevelModel" folder in github. I have used a new dev and val split and in the new val split it is getting around 0.8508.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133409270.

SudalaiRajkumar commented 9 years ago

Rob, We need to work on the ones which Mark has mentioned above. Probably he can guide us what we can work on exactly to help him.

I am going to travel for the next 14 hours and so I wont be making any useful contributions during this time. Apologies.!

mlandry22 commented 9 years ago

0.8508, that is great. That is certainly worth posting, so I'll go ahead and see if I can recreate. Your "ugly scripts" are fantastic, and great examples of why I need to learn python more. Not only is scikit nice (not a fan of pandas yet), but it can be really fast (compared to R) for basic procedural work often done in the data munging phase. Safe travels. I hope your famlily is doing well. Certainly no need to apologize! Rob and I will have it covered :-) Hopefully we're in the top 10 by the time you next get a chance to check in.

SudalaiRajkumar commented 9 years ago

Thanks Mark. Yup. They are doing good and am traveling to meet them.

I agree with you that python is comparatively more faster in data munging especially if the data is somewhat big and complex to do preprocessing (involving 'for' loops). I am not a big fan of pandas either. I think when it comes to dataframe transformations, R does a good work than pandas. I may be wrong as well.

I did not include those missing cookies and so it might add some extra value as well.

This score is on a new val sample which is smaller than our previous val sample and so am not sure how it will turn out in LB. Hopefully we will be in top ten when I start coding again tomorrow morning my time :)

On Friday, August 21, 2015, Mark Landry notifications@github.com wrote:

0.8508, that is great. That is certainly worth posting, so I'll go ahead and see if I can recreate. Your "ugly scripts" are fantastic, and great examples of why I need to learn python more. Not only is scikit nice (not a fan of pandas yet), but it can be really fast (compared to R) for basic procedural work often done in the data munging phase. Safe travels. I hope your famlily is doing well. Certainly no need to apologize! Rob and I will have it covered :-) Hopefully we're in the top 10 by the time you next get a chance to check in.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133479483.

CarbonCycles commented 9 years ago

Man I feel like the bastard step child running pandas. lol.

When I get home I'll jump in. Really wish I could do haggle here. Mark has the best job

Sudalai I wish the best for you and your family. Many happy smiles to you!

Rob

On Aug 21, 2015, at 11:37 AM, SudalaiRajkumar notifications@github.com wrote:

Thanks Mark. Yup. They are doing good and am traveling to meet them.

I agree with you that python is comparatively more faster in data munging especially if the data is somewhat big and complex to do preprocessing (involving 'for' loops). I am not a big fan of pandas either. I think when it comes to dataframe transformations, R does a good work than pandas. I may be wrong as well.

I did not include those missing cookies and so it might add some extra value as well.

This score is on a new val sample which is smaller than our previous val sample and so am not sure how it will turn out in LB. Hopefully we will be in top ten when I start coding again tomorrow morning my time :)

On Friday, August 21, 2015, Mark Landry notifications@github.com wrote:

0.8508, that is great. That is certainly worth posting, so I'll go ahead and see if I can recreate. Your "ugly scripts" are fantastic, and great examples of why I need to learn python more. Not only is scikit nice (not a fan of pandas yet), but it can be really fast (compared to R) for basic procedural work often done in the data munging phase. Safe travels. I hope your famlily is doing well. Certainly no need to apologize! Rob and I will have it covered :-) Hopefully we're in the top 10 by the time you next get a chance to check in.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133479483.

— Reply to this email directly or view it on GitHub.

CarbonCycles commented 9 years ago

Mark

Since I'm working in Python almost exclusively let me know what you need done.

Rob

On Aug 21, 2015, at 11:37 AM, SudalaiRajkumar notifications@github.com wrote:

Thanks Mark. Yup. They are doing good and am traveling to meet them.

I agree with you that python is comparatively more faster in data munging especially if the data is somewhat big and complex to do preprocessing (involving 'for' loops). I am not a big fan of pandas either. I think when it comes to dataframe transformations, R does a good work than pandas. I may be wrong as well.

I did not include those missing cookies and so it might add some extra value as well.

This score is on a new val sample which is smaller than our previous val sample and so am not sure how it will turn out in LB. Hopefully we will be in top ten when I start coding again tomorrow morning my time :)

On Friday, August 21, 2015, Mark Landry notifications@github.com wrote:

0.8508, that is great. That is certainly worth posting, so I'll go ahead and see if I can recreate. Your "ugly scripts" are fantastic, and great examples of why I need to learn python more. Not only is scikit nice (not a fan of pandas yet), but it can be really fast (compared to R) for basic procedural work often done in the data munging phase. Safe travels. I hope your famlily is doing well. Certainly no need to apologize! Rob and I will have it covered :-) Hopefully we're in the top 10 by the time you next get a chance to check in.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133479483.

— Reply to this email directly or view it on GitHub.

mlandry22 commented 9 years ago

I think a nicely partitionable task is to have a brand new model to hunt down our best choices for the devices that have no cookie. An example is the CSV I posted: id_10_fixes.csv in the SRK folder. There are 703 cookies where our latest model has no cookie prediction. That file itself does have cookies populated--I reached back to older code to get them. But if you want to try and find a better method of choosing those cookies, that should help.

Option 2 would be to play around with device:device matching. About 2% of the known device handles overlap devices, so maybe you take a shot at seeing whether that is something we can learn, and if it would help. That is simpler, yet probably harder to figure out what to do with in these last 3+ days.

Unfortunately, I made a mess out of my local environment, so I'm having to restate many of the preprocessed data sets. I tried to remove R quotes and overwrote the real files with blank files and have no way to go back. So I'm trying to get on a server and have it processed there. Having the data there could make it possible to run some cutoff:200 models, perhaps.

mlandry22 commented 9 years ago

Rob, most of the world loves pandas. But I'm just not used to fighting the system to figure out how to log an integer variable that pandas treated as a series. And that is just the start of a long war between me and python, I expect. At H2O, python is front and center right now. It's mainly an R tool, and now that it has a python API, we are getting people who are very familiar with python suggesting/complaining about all sorts of things. And we realize we don't have a lot of pythonistas in house. So we're trying to get up to speed quickly. But many R people keep looking at oddities here and there and wondering what all the pandas rage is about! To me, I can take or leave scikit. It's good, no doubt. But I can replicate nearly all of that myself in R if I'm motivated. It's the very code Sudalai keeps outputting at a very fast rate that makes python the most compelling. Low-level speed if you need it. If R can match it, I've never seen people try. It always tends to be the python code that gets popular in these larger data type competitions. Never R.

CarbonCycles commented 9 years ago

Mark

I was kidding. Just being silly. It's Friday and been one hell of a week. Would love to grab a beer right now

I totally get where you are coming from w R. If I could use it here I would. Check out packages patsy and statsmodels. It has a R interface look and feel. It's helped me make the transition a bit more palatable.

:)

On Aug 21, 2015, at 1:32 PM, Mark Landry notifications@github.com wrote:

Rob, most of the world loves pandas. But I'm just not used to fighting the system to figure out how to log an integer variable that pandas treated as a series. And that is just the start of a long war between me and python, I expect. At H2O, python is front and center right now. It's mainly an R tool, and now that it has a python API, we are getting people who are very familiar with python suggesting/complaining about all sorts of things. And we realize we don't have a lot of pythonistas in house. So we're trying to get up to speed quickly. But many R people keep looking at oddities here and there and wondering what all the pandas rage is about! To me, I can take or leave scikit. It's good, no doubt. But I can replicate nearly all of that myself in R if I'm motivated. It's the very code Sudalai keeps outputting at a very fast rate that makes python the most compelling. Low-level speed if you need it. If R can match it, I've never seen people try. It always tends to be the python code that gets popular in these larger data type competitions. Never R.

— Reply to this email directly or view it on GitHub.

SudalaiRajkumar commented 9 years ago

Thank you Rob.

We can make use of val_predictions.csv and test_prediction.csv file present in the shared gdrive if needed

Thanks, Sufalai

On Saturday, August 22, 2015, Rob C. notifications@github.com wrote:

Mark

I was kidding. Just being silly. It's Friday and been one hell of a week. Would love to grab a beer right now

I totally get where you are coming from w R. If I could use it here I would. Check out packages patsy and statsmodels. It has a R interface look and feel. It's helped me make the transition a bit more palatable.

:)

On Aug 21, 2015, at 1:32 PM, Mark Landry <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Rob, most of the world loves pandas. But I'm just not used to fighting the system to figure out how to log an integer variable that pandas treated as a series. And that is just the start of a long war between me and python, I expect. At H2O, python is front and center right now. It's mainly an R tool, and now that it has a python API, we are getting people who are very familiar with python suggesting/complaining about all sorts of things. And we realize we don't have a lot of pythonistas in house. So we're trying to get up to speed quickly. But many R people keep looking at oddities here and there and wondering what all the pandas rage is about! To me, I can take or leave scikit. It's good, no doubt. But I can replicate nearly all of that myself in R if I'm motivated. It's the very code Sudalai keeps outputting at a very fast rate that makes python the most compelling. Low-level speed if you need it. If R can match it, I've never seen people try. It always tends to be the python code that gets popular in these larger data type competitions. Never R.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133527720.

mlandry22 commented 9 years ago

Guys, I'm sorry but my laptop seems to have let us down today. First, I tried to run the new stuff on my ubuntu 16GB. Failed Tried to recreate the files. Blew up out of memory. Tried to post the files to a 256GB server; permissions changed, couldn't FTP it there Tried to recreate the files I destroyed. After 3 hours was still on the first step. Restarted. Tried just Sudalai's code against the old predictions snapshots. One hour in, it hasn't finished the first step.

I don't know what's wrong, but I just can't get this data processed. It's a problem. I'll take both laptops home this weekend and try everything on both until I can get things in.

CarbonCycles commented 9 years ago

Guys

Sorry I haven't started. Finally leaving work. Will jump in after dinner. Been a crazy day

Mark no worries. You've done well

Rob

On Aug 21, 2015, at 7:03 PM, Mark Landry notifications@github.com wrote:

Guys, I'm sorry but my laptop seems to have let us down today. First, I tried to run the new stuff on my ubuntu 16GB. Failed Tried to recreate the files. Blew up out of memory. Tried to post the files to a 256GB server; permissions changed, couldn't FTP it there Tried to recreate the files I destroyed. After 3 hours was still on the first step. Restarted. Tried just Sudalai's code against the old predictions snapshots. One hour in, it hasn't finished the first step.

I don't know what's wrong, but I just can't get this data processed. It's a problem. I'll take both laptops home this weekend and try everything on both until I can get things in.

— Reply to this email directly or view it on GitHub.

mlandry22 commented 9 years ago

It just finished that last part. Needed to have switched over to this method sooner. I'm going to go through with it all and submit. We have 6 left, 2 between now and 23.7 hours from now, but still I want to lock in the gains we expect here before moving ahead. We don't have fantastic ideas waiting anyway.

mlandry22 commented 9 years ago

And....in case this happens again, I will try to have [our latest best methodology] + [our best way of resolving the 703 missing] always on standby so I can submit that. Easy way to independently test that latter step in case we get in a bind. Today, I saw the deadline approaching and had no alternatives available.

mlandry22 commented 9 years ago

Sudalai, can I ask a question about the final step. I've run dataPrep_secondModel.py, and also secondModel.py. Those ran fine. Now I'm running getMetric_secondModel.py

In the list of files it tries to get is val_develop_DV.csv I don't see where this file comes from. I've looked throughout my directories, and also in github to see references and can't find it. Is there another step I'm missing?

The files created after the first step appear to be: test_second_level.csv val_second_level.csv

Those files are read in by the second step, which then produces: dev_predictions_m2.csv val_predictions_m2.csv test_predictions_m2.csv

Am I missing something? It wouldn't seem like either of the first step files should be used since the val set isn't split yet.

mlandry22 commented 9 years ago

It seems if I just get cookie and device into a file of the same order as what XGBoost fit in the prior step, that would make the \ Actual values *** section work.

CarbonCycles commented 9 years ago

Mark,

Is that file on the google drive?

Rob

SudalaiRajkumar commented 9 years ago

Am sorry guys. We just need to split the val_DV.CSV file into two. I thought of putting those two files into drive but in a hurry I forgot. Kindly apologize me.

On Saturday, August 22, 2015, Rob C. notifications@github.com wrote:

Mark,

Is that file on the google drive?

Rob

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133624909.

SudalaiRajkumar commented 9 years ago

Okay. Here is the way to split.

Check the last device ID present in Dev_predictions_m2.csv . Find out the row (example say 22347) corresponding to that device id in val_DV.csv. Use head command to create the new file val_develop_DV.csv ( head -22347 val_DV.CSV> val_develop_DV.CSV ). Use tail command to create the second file val_validate_DV.CSV

Hope this helps.

On Saturday, August 22, 2015, Sudalai Rajkumar S ssraj.rox@gmail.com wrote:

Am sorry guys. We just need to split the val_DV.CSV file into two. I thought of putting those two files into drive but in a hurry I forgot. Kindly apologize me.

On Saturday, August 22, 2015, Rob C. <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Mark,

Is that file on the google drive?

Rob

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133624909.

SudalaiRajkumar commented 9 years ago

It is my turn now for the series of posts !

No worries Mark and Rob.:) Sorry from my side as well.

Yup Mark. We will have our standby submission ready from now on. I am still on the way :( I will put those two files in shared drive once I reach..

On Saturday, August 22, 2015, Sudalai Rajkumar S ssraj.rox@gmail.com wrote:

Okay. Here is the way to split.

Check the last device ID present in Dev_predictions_m2.csv . Find out the row (example say 22347) corresponding to that device id in val_DV.csv. Use head command to create the new file val_develop_DV.csv ( head -22347 val_DV.CSV> val_develop_DV.CSV ). Use tail command to create the second file val_validate_DV.CSV

Hope this helps.

On Saturday, August 22, 2015, Sudalai Rajkumar S <ssraj.rox@gmail.com javascript:_e(%7B%7D,'cvml','ssraj.rox@gmail.com');> wrote:

Am sorry guys. We just need to split the val_DV.CSV file into two. I thought of putting those two files into drive but in a hurry I forgot. Kindly apologize me.

On Saturday, August 22, 2015, Rob C. notifications@github.com wrote:

Mark,

Is that file on the google drive?

Rob

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133624909 .

mlandry22 commented 9 years ago

No need for the files, for me at least, I follow what we're doing. Will do the Linux commands as you suggested. Will report a few variations of doing this and submit the best.

SudalaiRajkumar commented 9 years ago

Thanks Mark. Please let me know what shall I start working on next.

On Saturday, August 22, 2015, Mark Landry notifications@github.com wrote:

No need for the files, for me at least, I follow what we're doing. Will do the Linux commands as you suggested. Will report a few variations of doing this and submit the best.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133629658.

mlandry22 commented 9 years ago

To be honest, I think the best thing right now might be to make a submission, if that's possible, using whatever scored so well for you half a day ago.

I did everything and it seemed to work, but I only got a local validation of 0.808, and 0.819 on the training. So it seems I might not have all the info in there correctly. I'll inspect, but if you have a test version ready to send, go ahead and do that.

SudalaiRajkumar commented 9 years ago

Oh okay. I will check if I have the test version ready or else I will create one.

A quick check. When you ran the second model, did you get a test auc of 0.9983??

Thanks, Sudalai

On Saturday, August 22, 2015, Mark Landry notifications@github.com wrote:

To be honest, I think the best thing right now might be to make a submission, if that's possible, using whatever scored so well for you half a day ago.

I did everything and it seemed to work, but I only got a local validation of 0.808, and 0.819 on the training. So it seems I might not have all the info in there correctly. I'll inspect, but if you have a test version ready to send, go ahead and do that.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133633121.

mlandry22 commented 9 years ago

Yes, fairly close to that. 99812 and 997763

mlandry22 commented 9 years ago

Some line counts, since the final manual step seemed quite unbalanced to me:

1182989 dev_predictions_m2.csv
2809824 test_predictions_m2.csv
2809824 test_second_level.csv
14047692 val_predictions_m2.csv
15230680 val_second_level.csv

 cd ../../../Data/

2382 val_develop_DV.csv
25976 val_validate_DV.csv
61157 sampleSubmission.csv
28537 val_DV.csv
mlandry22 commented 9 years ago

It looks like we're using 10x the size to validate the model as to train it, is that right? 1,182,989: dev_predictions_m2.csv vs 14,047,692: val_predictions_m2.csv

...Though that wouldn't seem to be a problem since the AUC is so good.

mlandry22 commented 9 years ago

I'm trying a second variation. I had used the predictions with the -1's already removed. This time I'm using the exact files I had downloaded on Google Drive. Thank goodness for CV, so we didn't burn a submission on that.

Like last time, it'll probably take a while before I get the first prep stage done. Will post when I get an answer out. But if you have anything that scores in the range you had earlier, please do post. If not, don't worry, I'll work though this and ensure I can get the same thing.

SudalaiRajkumar commented 9 years ago

oh okays. I will try and make a submission then.

BTW the fourth place guy shared some trick here, https://www.kaggle.com/c/icdm-2015-drawbridge-cross-device-connections/forums/t/15877/legal-or-not/90041#post90041

I am yet to read it fully though. Thought of sharing with you guys first.

On Sat, Aug 22, 2015 at 9:58 AM, Mark Landry notifications@github.com wrote:

Yes, fairly close to that. 99812 and 997763

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133633857.

SudalaiRajkumar commented 9 years ago

Okay here are my counts...

  1182989 dev_predictions_m2.csv
   300001 val_predictions_m2.csv
  1482989 val_second_level.csv
  3114717 test_predictions_m2.csv
  3114717 test_second_level.csv
  1482989 val_predictions.csv
  1482989 val_second_level.csv

I think this is the reason for the difference. I think I have a smaller validation set... Did you use a higher value (30 or 200) for cookie cutoff?

SudalaiRajkumar commented 9 years ago

We already have my val_predictions.7z and dev_predoictions.7z in the sharead google drive

Now I have added a new folder in google drive names "Raj_SecondLevelModel". It has

  1. val_develop_DV.7z
  2. val_Validate_DV.7z
  3. sub21.7z - new submission file

I have also modified "getMetric_secondModel.py" file. Now am seeing a val sample F0.5 of 0.8514. Can you please check sub21.csv with sub20.csv (top lines) and confirm whether both are fairly similar.? If so we can go ahead and post this submission and see the LB score. I know I can download from Kaggle but am sorry that my internet connection is bad here and so it might take some time to download.

EDIT: I have also added dev_predictions_m2.7z and val_predictions_m2.7z in to the same folder. You guys can directly run the 'getMetric_secondModel.py' script directly using these two files now :) I am prone to mistakes especially at last moments. Since we have only very less submissions left, I just want to be double sure before submitting it and so bothering you people. Sorry guys.!

EDIT 2: I have compared the top few lines of new submission (sub21.csv) with our previous best (sub19.csv). I think some of the cookies are removed in the new submission and I guess that is the reason fot the improvement in score.

mlandry22 commented 9 years ago

I got it figured out. Looking pretty good now.

Getting evaluation in dev sample..
Mean of  22648.0  is :  0.856927211263
Getting evaluation in val sample..
Mean of  5888.0  is :  0.850832564421
SudalaiRajkumar commented 9 years ago

Great Mark. If you think this makes sense, please go ahead, submit and let us know the results. May be we could add some place holder from sub3.csv for missing cookies as well in this submission.

mlandry22 commented 9 years ago

For some reason our test files are a little different. On mine for the tenth line, I get: id_1000497,id_1127125 And the one you posted to Google Drive has id_1000497,id_1127125 id_1869275 id_2204003

I looked those three cookies up and they all have handles, which is the only change my files had made...I think. But with the differences I've had, I'll try yours as is.

I will try and add the 703, preferably the version with the handles crosswalked.

SudalaiRajkumar commented 9 years ago

I made minor changes to the "getMetric_secondModel.py" code and that may be the reason. My latest code is now updated in the github. Fingers crossed. Thank you.

mlandry22 commented 9 years ago

We got some pretty good improvement. However, we're in a pocket where that doesn't get us many places. But now we are really really close to that entire 0.85 pack:

image

SudalaiRajkumar commented 9 years ago

This is really great Mark :) We are also in 0.85 range now. Thank you.

If we reach 0.86 now, we will be in 4th place. let us try for that. Some initial ideas

  1. Increasing the cutoff from 30 to some higher value and rerunning the models. This will help find out some more missing cookies
  2. A good device-device model
mlandry22 commented 9 years ago

Reading the post by the 4th place person, a lot of it is known to us.

1: he agrees that all the property and category stuff has little value. Some, perhaps. But I don't think that we'll come up with much to be able to use that. He suggests a TF/IDF style weighting of rarely visited sites in common. Makes sense. But you need visibility into the entire data set to calculate IDF. Difficult with my issues trying to get this data onto a server today.

2:

What I think might be interesting is this:

You can divide the devices into many categories, I have divided them into 6 categories, 
which I can say how many devices I have in each category and what is my prediction 
accuracy in that category. try to categorize them and work on the devices with low 
accuracy rate.

He's not giving anything away, but he is saying that some fairly basic properties of the devices are useful. Unless he thinks of "cookies per device" as a basic feature, perhaps we are overlooking something. I haven't seen any features look too strong about the basic info. Though our strongest feature is related to anonymous C3, that's just the one that best captures the notion of rare overlap, I believe. Perhaps to find it, we can work backwards. Calculate our F0.5 score for everything and then see how it changes across the elements of the devices. That's how he is saying he figures out what to work on. Similar style as when I was putting out some tables about how well we do when we have 1 match, 2 matches...100 matches, etc.

I have a feeling he treats fairly complex things like simple things. This is probably good for us, in case people try to implement simple things, not realizing what he's talking about. Hopefully we don't find one of those superstar teams passing us out because of an idea in that post. I don't think it gives away too much, but you never know.

Sudalai, I agree. It would have been nice to have gotten the data onto an H2O server. As is, I will be in the office tomorrow, so maybe I can try again. Or rent a cheap AWS server. The device:device model seems interesting (more fun to me). Possibly we will stumble on what the 4th place person was doing in the process.

Also, we can try to analyze where we are missing out. With our validation F-0.5 up from 0.75 to 0.85 since the last time I checked, the errors may be more obvious now, and we can find them that way.

mlandry22 commented 9 years ago

I'll post this in my snippets section, but here is some R code for anybody to merge a patch file, like the 703 id_10s, merged with a full submission file. Right now I don't think using this is helping much, but it shouldn't hurt, and I intend to try to look into these and see if I can find any alternatives. So at that point, using basic merging code will ensure anybody can submit our best. Use python if you want, but you should be able to copy this, use correct file names, and it will work.

library(data.table)
a<-fread("id_10_fixes.csv")
b<-fread("../../../Data/cookie_all_basic.csv")
a<-merge(a,b[,1:2,with=F],by="cookie_id",all.x=TRUE)
a$drawbridge_handle[a$drawbridge_handle=="-1"]<-"-999"  ## to avoid matching all the -1s in cookies
a<-merge(a[,c(2,3),with=F],b[,1:2,with=F],by="drawbridge_handle",allow.cartesian=T)
a <- a[, .(cookie_id=paste(cookie_id, collapse=' ')), device_id]
srk<-fread("SRKsub21.csv")
s2<-merge(srk,a,by="device_id",all.x=TRUE)
s2[,cookie_id:=ifelse(is.na(cookie_id.y),cookie_id.x,cookie_id.y)]
write.csv(s2[,.(device_id,cookie_id)],"sub21b.csv",row.names=F,quote=F)
SudalaiRajkumar commented 9 years ago

Agreed with your points and thanks for the code snippet.!

Probably if memory and space are issues we could try a version where the cut off is higher than 30 but not as high as 200, say 75 or 100 and then we could give a try.

SudalaiRajkumar commented 9 years ago

Here are my updates. Not much improvement though!

  1. I tried bagging of 4 xgboost base models compared to our single base xgboost model. It improved the val sample AUC from 0.9945 to 0.9947.
  2. Then built the second level model which improved the new val sample AUC from 0.9983 to 0.9984.
  3. Then finally "getMetric_secondModel.py" code improved the score by a tiny bit to 0.8527 which is not significant though.

I have placed the codes in github (SRK/Bagging_22Aug/) and the output csv files in google drive (SRK_22Aug_Bagging). In case if we plan to use the new ones, we don't have to run the first two steps mentioned above. We could just get the output csv files from the google drive and just run the third code. We will get a minor improvement.

Though we did not get much of an improvement, a small improvement might also help towards the end and so placed it in shared folder and github.

mlandry22 commented 9 years ago

Sounds good. If nothing else, it will help to have a second submission on the same level, for our final choices. I think this time our submission choices will be easy: just choose the best two. So having more diversity will be good. If we don't have anything else by the deadline, I'll get that submitted. Thanks!

SudalaiRajkumar commented 9 years ago

Thank you Mark.

CarbonCycles commented 9 years ago

Hey gents,

Saw the flury of activities, and I wanted to give a little update. Been trying to use different clustering approaches to see if we can get any additional hints...nothing too definitive. I've spent the morning reworking the data so that I can try PCA on it...Mark, I know you aren't a fan. I'll let you guys know if anything shakes loose.

Rob

On Sat, Aug 22, 2015 at 12:49 PM, SudalaiRajkumar notifications@github.com wrote:

Thank you Mark.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/icdm-2015/issues/6#issuecomment-133732720.