stripe-archive / brushfire

Distributed decision tree ensemble learning in Scala
Other
391 stars 50 forks source link

[WIP] Feature encoders #80

Open tixxit opened 8 years ago

tixxit commented 8 years ago

This is super early work, but we have a CsvTrainerJob, which can run on ~arbitrary CSVs, with the labels provided by the user. The actual types of the values will be inferred before training.

tixxit commented 8 years ago

@avibryant So, this has some of the stuff I've been doing (still very WIP), but the important bits are:

There is an implementation, DispatchedFeatureEncoding, of a FeatureEncoding for the dispatched type, which has a "trainer" that does a pass over the data and attempts to infer the sub-type of Dispatched to use for it.

The main goal of the FeatureParser vs FeatureEncoder split is so that we can separate the input type from the input-type agnostic feature encoding bits from the tree K/V type. So, we can train off CSV data or thrift and still write a web service that accepts JSON.