stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

Add an Encoder to the pipeline #75

Open avibryant opened 8 years ago

avibryant commented 8 years ago

It would be helpful to have a concept of an Encoder[K,U,V] which can transform Map[K,U] to the Map[K,V] used by a given set of trees, and which can be serialized along with the trees.

We also want (in some cases) to be able to "train" these encoders from a training set - for example to figure out which numeric features are continuous vs. discrete, or even to do dimensionality reduction etc.

avibryant commented 8 years ago

A much more radical design. @tixxit, @NathanHowell, curious what you think.

Assume that Instance just has a single type for its feature vector, V. An Encoder[V] produces a bitset (or Vector[Boolean]) from each V. We assume that Encoder is generally composed of many other Encoders. Each bit represents something that would appear as a predicate in the tree, like age > 20.

All the games with QTree and so on for determining good split points happen in the stage that trains the Encoder, which in many cases can probably be done once at the root. Then the expansion becomes much simpler - we don't have fancy Splitters or anything like that; we just accumulate a separate (T,T) for each bit in the encoding, and pick the best one based on the error. In fact, because we have the parent T and T is almost always a Group, we can just accumulate the T for the instances with the given bit set and figure out the other half.

(In a world with metadata, we also want (M, M) for each bit. This is less likely to be a Group, although it seems pretty plausible even then that with one side's M and the parent's M you could make a reasonable determination as to whether this was an allowable split.)

If it's unsatisfying to determine all of the possible splits up front at the root, you can always add in a re-encoding phase where, given an established tree, you train encoders for each of the leaves, and the union the resulting encodings to get a superset of all of them.

Some advantages:

Cons:

This all make sense?

avibryant commented 8 years ago

... apologies for thinking out loud, but:

The above - encoding to a bitset - assumes that there's a bounded, finite number of predicates we might want in the tree. But that's more limiting than we want, and more limiting than necessary. As a simple example, imagine a text feature, where we want the predicates to be of the form contains_token("foo"). There's an infinite number of these predicates possible.

The properties we actually want from an encoder are:

I'm not sure yet how that translates to an implementation.

avibryant commented 8 years ago

Ok, so in concrete terms, maybe something like:

The U just needs to be something with equality/hash/etc, and that you can serialize into your model. I can almost see an argument for forcing it to be String (for comprehensibility of the serialized models) or to be Long (for efficiency of tree representation - and anything that doesn't map nicely to a Long you just hash). But that's a pretty unimportant side note.

In this scheme you would probably specify all the available Encoder instances up front; the "training" phase (including re-training with a larger tree later) would be to increase the available set of U values for the Encoder (in the cases where that made sense). So for example in a continuous numeric feature, you'd imagine the encoder keeping a List[Double] internally of split points which it would use to produce the Set[U] of all split points where the V's value was < the split point; it would choose this List empirically based on looking at a QTree or whatever of the training set.