stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

Annotated instances WIP #74

Closed NathanHowell closed 4 years ago

NathanHowell commented 8 years ago

NOTE: this doesn't work right now, wanted to open this now before I actually complete the work.

I'm looking at efficient ways to calculate some statistics from Instance instances and attach them to tree leaf nodes. One idea I had was to replace the timestamp parameter (which I don't always need) with A and use Semigroup.sumOption to populate LeafNode instances as they're created. If the timestamp is needed it may also be an interesting thing to track with a (Min[Long], Max[Long]) tuple (or something fancier) to see time ranges per node.

The real world problem is that I have a lot of Instances and a lot of customers. An Instance is tied to a single customer, but a customer can have multiple Instances. I want to count the number of distinct customers per split (via Monoid[HyperLogLog], which is usually much different from the target distribution. It would be interesting to calculate these statistics per target as well.

tixxit commented 8 years ago

I feel like something like Aggregator may be better suited here to "sum" instance annotations. Namely, the annotation on the instance (call it L) is not necessarily the same type that we want as the final tree node annotation (call it A). So, Trainer would also take some Aggregator[L, _, A]. In your case, your L is the customer, the ignored middle type would be an HLL, and A would be the count (Int?).

NathanHowell commented 8 years ago

Yeah, that would work just fine. Seems like this would also require removing the Semigroup constraint on A (related to that other open pull request). Do you think this approach is worth pursuing?

NathanHowell commented 8 years ago

Related question: would the cached annotation on SplitNode still be necessary for your usage?

avibryant commented 8 years ago

Ah interesting. For a while we had a weight: W attribute on Instance for nearly exactly this use case (multiple charges per merchant, interested in unique merchants per split). What we switched to was just having a more complex T value - like, instead of having T be Map[Boolean,Long] it could be Map[Boolean,(Long,HLL)]. Would that work for you? There's probably some worthwhile work to be done making this kind of thing more convenient...

Incidentally, a strategy we've played with for the problem you may be worrying about here (of overfitting to specific, high-volume customers), is to use the customer for the Instance id, which means trees are sampling whole customers instead of individual instances, which means that no customer can be in every tree, which helps with the overfitting. (This, too, might be worth making more convenient as a pattern).

NathanHowell commented 8 years ago

It's for two different problems: one is preventing high traffic users from dominating the trees (the top 1% are being discarded entirely now), the other is for visualization and pattern/sequence mining.

Sticking a HLL in T would also be acceptable. I'll try this out later today.

NathanHowell commented 8 years ago

Another related idea: splits that cover a wider time period might be preferred over a narrower period, particularly towards the root of a tree.

avibryant commented 8 years ago

Yeah, interesting idea.

It does feel unsatisfying to combine the simple minimization of training error (or information gain or however you want to think about it) with what are basically anti-overfitting heuristics (ranging from a simple minimum instance count to this kind of much more sophisticated thing). It does seem reasonable for the framework to deal with these independently...

I wonder if the right thing here is to have a Instance[M,Map[K,V],T], where:

NathanHowell commented 8 years ago

Okay, now I'm attempting to implement your last suggestion. It's also not functional yet but it does seem a bit nicer. Does this seem like it's going in the right direction?

avibryant commented 8 years ago

Oh great! This definitely feels like the right direction. (Will respond to your more detailed comments later).

avibryant commented 8 years ago

(At this point do we want to make AnnotatedTree be just Tree? I don't think we use it without an A anywhere anymore, right?)

NathanHowell commented 8 years ago

AnnotatedTree isn't required anymore. The latest push removes this, but is still not fully working yet.

NathanHowell commented 8 years ago

It's getting closer, still some failing tests though.

avibryant commented 8 years ago

You asked about Hashable but I can't find that comment here anymore. Anyway, I kinda agree that M => String is probably just as good for now.

NathanHowell commented 8 years ago

Alright, I've backed out the Hashable goo and tried to minimize the diff.

NathanHowell commented 8 years ago

Alright, done with comments for now.

NathanHowell commented 8 years ago

Following up to @tixxit's earlier comment: using Aggregator may be the best route. We're half way there already and I ended up needing something like Aggregator.present in our training pipeline.

avibryant commented 8 years ago

Hm, if I understand @tixxit's comment, though, he's implying that the final (post-present) type is what would live in the actual tree? I guess that's ok as long as you are storing those values for the interior nodes as well, but the fact that you lose the ability to sum the metadata for two subtrees to get the metadata for the parent bothers me a bit...

avibryant commented 8 years ago

(OTOH from a serialization POV I get it: it's often going to be much easier to serialize the final presentation type than the intermediate, summable one. So on reflection that probably does make it worthwhile).

avibryant commented 8 years ago

@NathanHowell what's the status of this PR - should we consider it complete at this point?

NathanHowell commented 8 years ago

I have some fixes from Spark that need to get rolled into the Scalding driver. Otherwise it's working.

On Wed, Dec 23, 2015 at 10:34 AM, Avi Bryant notifications@github.com wrote:

@NathanHowell https://github.com/NathanHowell what's the status of this PR - should we consider it complete at this point?

— Reply to this email directly or view it on GitHub https://github.com/stripe/brushfire/pull/74#issuecomment-166966268.