stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

Make Predicate[V] easier to work with. #84

Closed erik-stripe closed 8 years ago

erik-stripe commented 8 years ago

In line with the move toward binary splits, we can simplify our predicates to make them easier to work with. Previously we've relied a lot on ad-hoc encodings, (such as Not(LessThan(_)) and OneOf(LessThan(_), EqualTo(_))).

This commit replaces the previous predicate hierarchy with a set of six non-recursive predicate types:

Predicates will display themselves in a more friendly format, which should make them easier to work with. They also now come with a negation operator (!p) which should remove the necessity of creating/matching Not(_) nodes in user code.

The Ordering[V] requirement is moved from the Predicate[V] constructors into the apply() and run() methods. This will reduce the memory use for predicates, and also simplifies the serialization code.

The support for IsPresent(_) has also been removed. If you want missing values to be treated specially (i.e. treated as just matching or not matching), you'll need to arrange for that to happen before the row value is constructed. Future work might generalize rows from Map[K, V] to (K => V) to make this process more efficient.

?r by @tixxit, @avibryant, and/or @jvns.

erik-stripe commented 8 years ago

The premise behind this PR is that in the future if we want to change how missing values affect traversal, we would use feature encoding. (I think that's where we ended up during our last conversation.)

avibryant commented 8 years ago

The premise behind this PR is that in the future if we want to change how missing values affect traversal, we would use feature encoding

I think that's basically right, although one clarification/question: rather than being something we do during feature encoding, I think it would need to be something we do during training/splitting. For example, for a bag of words feature, I'd expect the V to be Set[String], and this would be expanded into a number of "feature regions" during the training phase which would be aggregated separately. (Incidentally now that I think of this, it seems like a useful optimization for other feature types as well). But this means that the predicates generated would be things like GtEq(Set("foo")), with the meaning "contains the token 'foo'". That would mean we'd need a special Ordering to achieve those semantics, which we'll need to make available to anyone using these predicates. Does that seem reasonable?

jvns commented 8 years ago

This looks easier to understand to me!

tixxit commented 8 years ago

@avibryant Yeah - but I think in that case we'd go further and actually want our own wrapper around Set that guarantees the Ordering we expect.

Another option Erik and I were thinking about was that during feature encoding, you'd presumably know of all possible values that can be seen. If we think of a row as just a K => Option[V], then we could easily have a "feature encoder" that can provide defaults for all valid values of K. That is, if we wanted to have a word-per feature encoding. I do prefer the set style with smarter splits though.

tixxit commented 8 years ago

Nothing to do with this PR, but for things like the Set case, we may, technically, want to use a partial ordering instead of an ordering.

erik-stripe commented 8 years ago

Yeah based on @avibryant's comment I think using a partial ordering makes sense. We can do that in another PR.

tixxit commented 8 years ago

:+1: