mlandry22 / rain-part2

How Much Did it Rain Pt 2 - Kaggle Competition
0 stars 0 forks source link

Our New Team! #2

Open mlandry22 opened 8 years ago

mlandry22 commented 8 years ago

Welcome John and Thakur! Let's hope the three of us can do some fun stuff with this one. I wouldn't be surprised if some small ideas plus a lot of blending go a long way. Let's hope between the three of us we can find those small ideas and hopefully somebody can be working on this while the others are busy with other things (that's how these things go, usually).

Years ago, I told Thakur I really wanted to get him his master's status. We were so, so close in the credit one, just losing out on the last couple days of what turned into a sprint to the finish. Maybe this will be a top 10 finish for us.

There's no right/wrong organization, but I figure we can use this thread for any team related stuff.

For me, my main purpose here is to get robust methods that make it easy to do competitions in a scalable way. Finishing fifth in the Rain Part 1, I'd like to do well here. But they've simplified the problem in ways that the parts I did the best on aren't as big a deal here. We'll see. Again, hopefully we will all provide a little value in our own ways.

mlandry22 commented 8 years ago

Oh, how about a team name? Simplest is the three parts: H2O.ai + JohnM + DataGeek or H2O.ai, Ctrl+Alt+Del & DataGeek

It's complicated for me, but I usually want H2O.ai in there, particularly since I think we can do well. Now you might be thinking "Mark, you're using R for this, not H2O". Well three answers to that: one, I was still using H2O because it was faster to run things; I am trying to find a transformation that allows me to be able to use MSE for this (which is why I can't use H2O directly); Arno is looking into MAE for H2O, though it surely won't be before H2O World (~2 weeks). More than you wanted to read about a silly marketing thing. But I think it's defendable here.

JohnM-TX commented 8 years ago

Hi Thakur! I'm OK with H2O.ai, Ctrl+Alt+Del & DataGeek though it sounds like some sort of geeked out law firm... :bowtie:

mlandry22 commented 8 years ago

Cool. That's what we have for now, and I'll change again if we want.

screen shot 2015-10-28 at 12 13 49 pm

Now, let's bump that rank up!

mlandry22 commented 8 years ago

H2O World is over, so I'll be back into working on this. In fact, I started work on the AutoML piece of H2O today, and I'm using the three Kaggle sets as my initial test data, so hopefully that will help out soon.

JohnM-TX commented 8 years ago

Sounds good, Mark. I took a break as well and am ready to reengage.

On Fri, Nov 13, 2015 at 6:32 AM, Mark Landry notifications@github.com wrote:

H2O World is over, so I'll be back into working on this. In fact, I started work on the AutoML piece of H2O today, and I'm using the three Kaggle sets as my initial test data, so hopefully that will help out soon.

— Reply to this email directly or view it on GitHub https://github.com/mlandry22/rain-part2/issues/2#issuecomment-156420645.

JohnM-TX commented 8 years ago

Hi guys - anything new? I'm still trying to find those outliers. Even when I reliably find some in the validation set, it's not translating to the test set.

ThakurRajAnand commented 8 years ago

Hi John .... was bit busy with shifting from Gurgaon to Hyderabad ... Will start tomorrow after office. Your approach looks like the right way to improve score. I will try and also suggest doing the following to find outliers.

Run t-SNE on train + test features and use something like k-means to group the reduced data into clusters and investigate each cluster and see which are the clusters which captures outliers cleanly. I think t-SNE done jointly will help us to find those group of outliers from test which we might be able to say with lot of confidence that they are outliers. Using a binary model might be leading to some correct as well as wrong classifications, which is basically not allowing us to improve on LB [ just a thought ... no guarantee :) ]

JohnM-TX commented 8 years ago

Thanks - I will try this. I haven't used t-SNE before and reading about it, it sounds great. Definitely seems better than manually sorting through dot plots.

ThakurRajAnand commented 8 years ago

Please use Rtsne package from R instead of tsne from R. Rtsne is very fast in comparison to tsne. Install Rtsne from github instead of installing from CRAN.

JohnM-TX commented 8 years ago

A general question - do you usually install R packages from github instead of CRAN or does it depend on what it is?

ThakurRajAnand commented 8 years ago

Mostly I install from CRAN but some packages add really good functionality time to time e.g slidify , rCharts , data.table, H2O. I install such packages from github or from package source if I want to use recently added function from a package.