stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

[WIP] Prune trees to minimize validation error #36

Closed roban closed 9 years ago

roban commented 9 years ago

Very much WIP at the moment, but it compiles, runs, and seems to to do something reasonable.

I added the digits dataset to have another example, and created example/iris-prune and example/digits-prune scripts (to run after example/iris or example/digits).

Advice welcome, @avibryant.

avibryant commented 9 years ago

This looks correct to me, I had a bunch of style notes.

roban commented 9 years ago

Thanks, I'll clean this up and test some more.

roban commented 9 years ago

I'm working through the style fixes, but in the mean time I wrote some code to visualize the difference between a tree and a pruned version of the tree.

Here's an example:

image

Red nodes are the one's pruned from the tree. Green nodes are the replacement leaves.

roban commented 9 years ago

@avibryant: thanks for all the tips. Really helpful as I get a handle on idiomatic scala. I'm going to do some testing with larger models on hadoop before I request another review.

avibryant commented 9 years ago

This LGTM except that:

avibryant commented 9 years ago

BTW an example of what I mean by hashing properly on the id is in Samplers: https://github.com/stripe/brushfire/blob/master/brushfire-core/src/main/scala/com/stripe/brushfire/Samplers.scala#L12

roban commented 9 years ago

Should I worry about the Travis error?

avibryant commented 9 years ago

As long as it builds, no, I think I broke travis earlier (and since we don't have any tests, it doesn't seem high priority to fix :)

avibryant commented 9 years ago

Apart from my one minor comment, LGTM