Logistic Regression - Githubissues

mlandry22 commented 9 years ago

I've been spending a lot of time the last couple days trying to get the right transformation function. And it occurs to me what I really want is bins, and then the method I want is multinomial classification. I might have pointed this out before, but this is the inspiration: http://fastml.com/regression-as-classification/

That is still one of my favorite competitions and it also implements absolute error, so it might be interesting. On that data set, there were a lot of duplicates, so that could be an important reason this might not work. But looking at how jagged our output values are (i.e. specific values occur often) it might be favorable to this method and should help keep rigid bins still reasonable. Additionally, I realize this makes it reasonable to solve with H2O, so I'm going to give it a try.

Tasks

Start simple: 10-20 bins
- Find bins
- create method of translating bins to output: probabilities to leading bin, or specific value
- understand how this compares with existing methods
Add more: 30-50 bins
- same tasks
Decide if sub-bins is worthwhile
- taking the initial 10-20 and subdividing some might make sense, but not sure what that would look like; analyze where loss is being incurred and whether narrower buckets using existing probabilities will make sense.
Decide if the entire thing is worthwhile
- This might just be a bad idea

mlandry22 commented 9 years ago

A few ways I was playing around with to cut up Expected based on the prevalent values.

train<-fread("input/train.csv")
rollup<-train[,.(Expected=mean(Expected),.N,naCount=sum(ifelse(is.na(Ref),1,0))),Id]
remove<-rollup[N==naCount,]
keep<-rollup[N>naCount,]
common0<-keep[,.N,Expected]
common1<-keep[,.N,.(Expected=round(Expected,1))]
common2<-keep[,.N,.(Expected=round(Expected))]
vals0<-common0[order(-N),][1:50,][order(Expected),]
vals1<-common1[order(-N),][1:50,][order(Expected),]
vals2<-common2[order(-N),][1:50,][order(Expected),]
cuts0<-cut(keep[,Expected],breaks=c(0,vals0[,Expected],Inf),labels=round(c(0,vals0[,Expected]),2))
cuts1<-cut(keep[,Expected],breaks=c(-1,vals1[,Expected],Inf),labels=c(-1,vals1[,Expected]))
cuts2<-cut(keep[,Expected],breaks=c(-1,vals2[,Expected],Inf),labels=c(-1,vals2[,Expected]))

#common1<-keep[,.N,.(Expected=round(Expected,1))]
vals3<-common1[order(-N),][1:20,][order(Expected),]
cuts3<-cut(keep[,Expected],breaks=c(-1,vals3[,Expected],Inf),labels=paste0("x",c(-1,vals3[,Expected])))

JohnM-TX commented 9 years ago

This sounds interesting. If I follow (which may not be the case) then it is similar to what I've been reading about precip estimates. Depending on the level of precipitation, the best way to estimate can be quite different. For light precip, you might use Ref, for medium precip you might use a different function with Kdp, and so on (realizing that this is an example and may not be the real functions.) Anyway, one of the things I haven't tried yet is breaking the data into 2 or more sets and modeling that way, but I think it's promising.

mlandry22 commented 9 years ago

You're right, technically they would get different models. The first version literally had different models to estimate the chance of Light, Medium, Heavy, and all the 1mm values between them, and they definitely took different variables into account. This is a similar idea. But I was about to set it up slightly differently, and I'm glad you made that comment because I should try both ways before seeing this through. The difference is that in Rain 1, each MM level had a binary classification model to estimate the probability independently. And that one was directly associated with the loss metric, so you directly used each of those probabilities. This one I am setting up as a multinomial classification, rather than binary. So the model is going to try and learn probabilities for each specific bucket compared against the others. Launching the first one....now. Will try to get quick feedback for us so we know whether it will be worth adding to the overall last-week strategy or not.

mlandry22 commented 9 years ago

It's early, but interesting to see how the GBM is solving the problem. First, it seems my way of setting up buckets missed, as some of these are not populated enough. However, I'm not too concerned yet, as 20 is a bit high for the first pass.

Aside from x14, this is in sorted order. You can read the bucket as the minimum of the bucket. So x0 contains those between 0.0mm and 0.1mm. x4.3 are the readings between 4.3mm and 14mm.

So what it appears to be doing is guessing the mode (x0.1) as the default and then finding ways to guess something the second most popular, which is x4.3. Error in all other buckets is 99% or so. Truthfully I'm far more concerned with the probabilities, in hopes that a mini-stacking can figure out the best absolute error guess, given the suite of 20 probabilities.

This was after 33 trees only and error is still well in the steep descent part of the curve, so plenty of room to go. Validation error is still nearly identical to training error (shown).

JohnM-TX commented 9 years ago

Don't know if domain knowledge is useful for you, but I found these articles informative: http://www.nwas.org/jom/articles/2013/2013-JOM19/2013-JOM19.pdf http://www.nwas.org/jom/articles/2013/2013-JOM20/2013-JOM20.pdf http://www.nwas.org/jom/articles/2013/2013-JOM21/2013-JOM21.pdf

There is a chart in the first article showing viable ranges of variables:

mlandry22 / rain-part2

Logistic Regression #9