Add an Encoder to the pipeline

avibryant commented 8 years ago

It would be helpful to have a concept of an Encoder[K,U,V] which can transform Map[K,U] to the Map[K,V] used by a given set of trees, and which can be serialized along with the trees.

We also want (in some cases) to be able to "train" these encoders from a training set - for example to figure out which numeric features are continuous vs. discrete, or even to do dimensionality reduction etc.

avibryant commented 8 years ago

A much more radical design. @tixxit, @NathanHowell, curious what you think.

Assume that Instance just has a single type for its feature vector, V. An Encoder[V] produces a bitset (or Vector[Boolean]) from each V. We assume that Encoder is generally composed of many other Encoders. Each bit represents something that would appear as a predicate in the tree, like age > 20.

All the games with QTree and so on for determining good split points happen in the stage that trains the Encoder, which in many cases can probably be done once at the root. Then the expansion becomes much simpler - we don't have fancy Splitters or anything like that; we just accumulate a separate (T,T) for each bit in the encoding, and pick the best one based on the error. In fact, because we have the parent T and T is almost always a Group, we can just accumulate the T for the instances with the given bit set and figure out the other half.

(In a world with metadata, we also want (M, M) for each bit. This is less likely to be a Group, although it seems pretty plausible even then that with one side's M and the parent's M you could make a reasonable determination as to whether this was an allowable split.)

If it's unsatisfying to determine all of the possible splits up front at the root, you can always add in a re-encoding phase where, given an established tree, you train encoders for each of the leaves, and the union the resulting encodings to get a superset of all of them.

Some advantages:

the tree itself can be much more compactly represented (this will be especially of interest to @tixxit) since the predicates are just indexes into an encoding. Of course we separately need to serialize the encoder which will be large.
we're doing less work while expanding each leaf, and sending less info around. Getting rid of all the joint distribution splitter stuff would be really nice and makes the training phase much easier to grok
we no longer have the goofy special case for "sparse" features; in effect, everything becomes a sparse feature. In general, we can ditch Dispatched or at least move its equivalent into the Encoder layer.

Cons:

as described, locks us to binary splits
makes it less likely we'll take advantage of local distributions for choosing splits (I'm not really sure how well the "re-encoding" thing would work in practice but presumably you wouldn't do that at every level anyway)

This all make sense?

avibryant commented 8 years ago

... apologies for thinking out loud, but:

The above - encoding to a bitset - assumes that there's a bounded, finite number of predicates we might want in the tree. But that's more limiting than we want, and more limiting than necessary. As a simple example, imagine a text feature, where we want the predicates to be of the form contains_token("foo"). There's an infinite number of these predicates possible.

The properties we actually want from an encoder are:

for any given instance, we produce a finite (and reasonably small) number of predicates which will return true for this instance
there are no predicates not in this set which will be produced for any other instance, which would return true for this instance

I'm not sure yet how that translates to an implementation.

avibryant commented 8 years ago

Ok, so in concrete terms, maybe something like:

you have encoders: Vector[Encoder[V]] for the model (where, again, V is now a generic type for your feature vector)
you have apply(vec: V): Set[U] on Encoder[V]
each split has a pred: (Int,U) where the Int indexes into the vector of encoders, and the idea is that you're splitting on encoders(pred._1)(vec).contains(pred._2)

The U just needs to be something with equality/hash/etc, and that you can serialize into your model. I can almost see an argument for forcing it to be String (for comprehensibility of the serialized models) or to be Long (for efficiency of tree representation - and anything that doesn't map nicely to a Long you just hash). But that's a pretty unimportant side note.

In this scheme you would probably specify all the available Encoder instances up front; the "training" phase (including re-training with a larger tree later) would be to increase the available set of U values for the Encoder (in the cases where that made sense). So for example in a continuous numeric feature, you'd imagine the encoder keeping a List[Double] internally of split points which it would use to produce the Set[U] of all split points where the V's value was < the split point; it would choose this List empirically based on looking at a QTree or whatever of the training set.

stripe-archive / brushfire

Add an Encoder to the pipeline #75