stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

replace Evaluator with Error #18

Closed avibryant closed 9 years ago

avibryant commented 9 years ago

This address #17 by removing Evaluator and replacing it with a minimization of the training error. The default is now to minimize brier score rather than maximizing chi^2.

This required adapting BrierScore to work when the actual distribution was more than just an indicator. @danielhfrank, you might want to review that part?

@snoble you should be able to play min-weight games by having the error go to infinity if the actual distribution's weight is too low, so I don't think we lose any expressivity that we care about.

There's no longer an obvious hook for CHAID-style split-merging, but we can cross that bridge when we come to it (having that be part of Evaluator was always suspicious anyway).

Also cc @brahn who this might be directly relevant to right now.

snoble commented 9 years ago

Your CHAID-splitting argument intrigues. StrawMan: have a SplitOptimizer component (which has an API like evaluator), and MinimizeError is the default SplitOptimizer.

avibryant commented 9 years ago

@snoble I've added an explicit trainingError method to Error which returns a Try[E] instead of just E, so that it can be overridden to return Failure[E] if there's an "unacceptable error" condition, ie weight < min. Splits with any failures in the trainingError will not be considered.

snoble commented 9 years ago

Seems appropriate that an Error class would have error handling.

snoble commented 9 years ago

lgtm from my perspective

avibryant commented 9 years ago

(Before I merge this I'd like to do a benchmark on some fairly complex training set)

avibryant commented 9 years ago

Benchmarking result: inconclusive. I'd feel better overall if we could implement chi^2 as an Error so that merging this didn't have to mean changing the default behavior.

snoble commented 9 years ago

It's not totally obvious to me chi^2, as it was previously implemented, can be implemented this way. If you use the E from the leaf instead of the parent you'll just end up with always 0 error. The leaf component of LLR will work though.