stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

Introduce a Double splitter based on TDigest's cumulative density estimates #68

Closed NathanHowell closed 8 years ago

NathanHowell commented 9 years ago

This pull request adds a splitter that relies on TDigest to provide target distribution estimates at fixed quantile intervals. It currently has no tests, I just wanted to get this up to get a review started.

avibryant commented 9 years ago

Hm so S should just be Map[L,TDigest] (for a Splitter[Double, Map[L,Long]]).

What's going on here is that a T is meant to be a distribution of the label L (in this case, by simply counting how many instances there are of each value of L). And the splitter's S needs to be a joint distribution of Double and L - so in principle you want to be counting co-occurences of a given Double value and a given L value. But the TDigest is already a distribution of Double, so what you want is the Map[L,TDigest].

What you have probably works because the inputs are Map[L,Long] where the Long is always 1L (because for any given single instance, that's what the label "distribution" looks like). And so that just becomes your somewhat unwieldy representation of a given L value. But it wouldn't work if you provided training data with different weights for some instances, and the semantics are all wonky.

NathanHowell commented 9 years ago

Took a stab at defining type S = Map[L, TDigest], which also seems to work for some simple tests. Is this going in the right direction?

NathanHowell commented 9 years ago

Fixed up the type variable names to be L instead of T and removed some other dumb stuff. Think this probably covers your feedback now.

avibryant commented 8 years ago

Sorry, missed the further commits until just now. LGTM - thanks!