Support TreeEnsembleClassifier op

sonos / tract

Tiny, no-nonsense, self-contained, Tensorflow and ONNX inference

Other

2.18k stars 210 forks source link

Support TreeEnsembleClassifier op #64

Closed aldanor closed 3 years ago

aldanor commented 5 years ago

(I'm aware that it overlaps somewhat with #56, but it's a bit more specific, hence opening it as a separate issue)

Given that it's now officially possible to convert LightGBM (and xgboost) tree ensemble classifiers into ONNX, how realistic would it be to expect tract to support TreeEnsembleClassifier op in the foreseeable future? This would potentially be a huge feature, instantly unlocking whole universe of tree ensemble classifiers (and potentially regressors as well).

// I'd be glad to help if there was some guidance on what to do and where, if needed; not quite sure how much of work it is to implement this since I'm not very familiar with the internals of tract.

Thanks!

kali commented 5 years ago

Hi,

I had not had a look at the -ML operators in ONNX yet. I just skimmed over the TreeEnsembleClassifier, and I don't foresee any structural problems.

I will not implement any of the -ML short term. My focus for now is on some optimization bits and recurring networks: basically, completing ONNX core. But I'd be glad to guide you through it.

From 30 000 feet, you will first need to find some kind of test suite: as far as I can tell, ONNX does not provide any for its -ML operators. Then figure out a way to implement the classifier (as in algorithms and structures).

Implementing an operator in tract is relatively straight forward, and we now have dozens of examples, but it's not documented.

Most implementations are in "core/src/ops" while onnx and tensorflow convert their op definitions to core, or implement them on their own if I decided they were too idiosyncratic to go in core. To be honest, I am not too sure what to do with -ML. We could put them in onnx directly, or have a tract-onnx-ml new crate to hold them and thus make them optional. That would definitely be my favourite option if implementation requires to pull in more dependencies. But for starters, I guess we can just shove them somewhere like onnx/src/mlops.

You can have a look at core/src/ops/nn/lrn.rs for an instance of a straightforward operator. As far as I can tell, the classifier is stateless, so its general structure should look like this one. The weirdest part is the rules part, which must produce as much information about input and outputs as it can for the framework to figure out tensors types and shapes. You can safely start with no rules, don't pull your hair on that right away, just return Ok(()).

Look for instantiation snippet by onnx in onnx/src/ops/nn/mod.rs. The register_all_ops methods are responsible for mapping operation names to op constructors. I assume the constructor for the classifier will be significantly bigger than the semi atomic operators we have so far, so it may be worth considering putting them on its own file...

Well that's about it :)

aldanor commented 5 years ago

Ok, this will take me a bit to digest but I'll try to take a shot at it; need to read through the existing codebase first :) But flipping through it, I think that implementing tree ensemble op should be possible.

Another difference from many of the implemented ops, I guess, is that a tree ensemble can accept tensors of types f32, f64, i32 and i64; and output either i64 or strings. Do we want to support string outputs?.. Can the currently existing traits support float/int inputs?

From 30 000 feet, you will first need to find some kind of test suite: as far as I can tell, ONNX does not provide any for its -ML operators. Then figure out a way to implement the classifier (as in algorithms and structures).

I can fit a few models with LightGBM and XGBoost and just use that as a test suite I guess.

The weirdest part is the rules part, which must produce as much information about input and outputs as it can for the framework to figure out tensors types and shapes. You can safely start with no rules, don't pull your hair on that right away, just return Ok(()).

Yea, that's the plan. Also, there's lots of input parameters/attributes and they need to be validated, because lots of them are strings like "BRANCHLEQ".

aldanor commented 5 years ago

@kali A first concrete question (apologies in advance, I think I'll have lots of them):

Currently, ONNX ops are registered in OpBuilder::new(), where it calls things like nn::register_all_ops().

or have a tract-onnx-ml new crate to hold them and thus make them optional.

If we make tract-onnx-ml a separate crate (should it be a separate crate?..), then how would OpBuilder know about it? Unless it always knows about it, but it's only enabled via a feature.

As such, which one makes more sense?

having a separate tract-onnx-ml crate, and an feature = "onnx-ml" in both tract-onnx and tract crates
having all ml stuff just enabled by default in tract-onnx, stuffed in onnx/src/ops/ml (as long as it doesn't pull many dependencies; I don't think tree ensembles would pull a single extra one, for instance)

I'd probably lean towards the latter.

kali commented 5 years ago

About the types: Tensor/SharedTensor can contain any of these: https://docs.rs/tract-core/0.2.7/tract_core/datum/enum.DatumType.html . I actually implemented just enough String to pass the onnx test that cast to and from float on master (no in 0.2.7 release). We may encounter a few issues with Strings, as so few operations are supported... Tensor::close_enough comes to mind, but nothing very difficult to fix. I'll jump in and help if it becomes too hairy.

About BRANCHLEQ and the like, we can have parameters validation in two places essentially : first is the operator instantiation method, I actually recommend using an enum to represent these string arguments. Second is the rules() method, but that would be for validating combinations of elements accepted from configuration and deduced from the network. For instance, this is where you can express the fact that if you're given classlabels_strings (and no classlabels_int64s), then the first output datum_type is String.

You're right that there is an issue with OpBuilder. In order to keep things simple, I suggest you just implement the classifier in tract-onnx for now (add ml::register_all_ops() like for nn). What I prefer long-term is a second crate tract-onnx-ml that depends on tract-onnx. I want to refactor a bit the model readers and ops registers around the tract_core::context::Context struct. That way we could augment the onnx context with onnx-ml ops before loading the model. Details are still sketchy, but I'll try to draft something one of these days.

aldanor commented 5 years ago

Thanks, that helps!

Another question - for the classifier, its output type depends on whether labels are provided as ints or strings, depending on input attributes, how should that be reflected? I don’t see how that can live within one op type, maybe I’m missing something. It’s almost like its two separate tree classifier types, one for int outputs and one for strings?.. (for regressors there’s no such problem, only the classifiers)

kali commented 5 years ago

I think you can accommodate both kind of labels in the same op. Tensor are dynamically typed, and rules() are here to enforce the dynamic type before running the network. I only have a sketchy understanding of what the classifier does, but my feeling is the label translation is somewhat of a post processor thing, so most of the op code should not care about the actual labels, right ?

So that leaves us with the internal representation (the field in the op), a naive way would be to have an Option<Vec> and an Option<Vec>. The consistency (exactly one of them defined) can be checked in the instantiation code. Or for a more typesafe option, define the field as an enum Labels { String(Vec<String>), I64(<Vec<i64>>) }

Does it make sens ?

aldanor commented 5 years ago

Yea, makes sense... I think. I'll get to it soon I hope, then it will become clearer.

Meanwhile, another question, here's an example from MaxPool (there's many more):

pub fn max_pool(node: &NodeProto) -> TractResult<Box<Op>> {
    let kernel_shape: TVec<usize> = node
        .get_attr_ints("kernel_shape")?
        .iter()
        .map(|&i| i as usize)
        .collect();
    /* ... */
}

Here, MaxPool expects a TVec<usize> and you get a &[i64] from proto. So... you convert it as i as usize which looks a bit flaky to my taste... what if i == -1? There won't be any error etc, it would just be a humongous number. Especially given that you have everything wrapped in results, plus there's a separate inference / validation part?

This actually repeats in many other places, also when it's a scalar, not an array.

Wouldn't it make sense to have something like get_attr_usize() and get_attr_usizes() (also with opt versions)? Which would return an Err when the value is negative automatically and with a proper nice error messages ("expected a positive value at ..., got ...").

Note that the multi-value getters (get_attr_usizes(), get_attr_opt_usizes()) then could no longer return a slice, it would have to be a TVec.

About the latter... I've scanned through the existing usages of get_attr_ints() and couldn't find a single case where you wouldn't just call .collect() after calling that, so that you could pass it on to a constructor of an op. Wouldn't it then make sense to just make get_attr_ints(), get_attr_floats() etc to return TVec right away? This would also make it consistent with get_attr_usizes(). If you're still concerned about "it's an extra allocation", could make it a separate get_ints_vec() or something like that. Just an idea. Also, get_strings_vec() could be very handy.

The code above could then become:

pub fn max_pool(node: &NodeProto) -> TractResult<Box<Op>> {
    let kernel_shape = node.get_usizes_vec("kernel_shape")?;
    /* ... */
}

kali commented 5 years ago

I have no objections with this plan. No objection with dedicated positive extractor with cleaner validation, no objection with changing to returning tvecs. Maybe use _tvec as a prefix instead of _vec though.

You’ll probably find dozens of sites where tract can be improved like that, to be honest :)

aldanor commented 5 years ago

I've actually found a reference implementation from Microsoft + some basic tests here, would definitely use that.

So, another general question: in ONNX protobuf, this is how the classifier is configured. There's basically a list of nodes, a list of leaves, and some general config settings. Each node belongs to a tree; nodes are mapped to leaves, etc. To avoid arrays of structs, in protobuf this is defined as a struct of arrays, whose lengths must match.

In Rust, do you think the TreeEnsembleClassifier should look like this:

pub struct TreeEnsembleClassifier {
    node_ids: TVec<i64>,
    node_treeids: TVec<i64>,
    node_kinds: TVec<NodeKind>,
    node_featureids: TVec<i64>,
    node_values: TVec<f32>,
}

or like this:

pub struct Node {
    pub kind: NodeKind,
    pub feature_id: usize,
    pub value: f32,
}

pub struct Tree {
    nodes: TVec<Node>,
    ...
}

pub struct TreeEnsembleClassifier {
    trees: TVec<Tree>,
    ...
}

(in which case there's going to be quite a lot of conversion and validation logic happening when reading it from &NodeProto)

In order for it to be reasonably fast, the classifier object should pre-cache some maps and other stuff, which should happen in the constructor (and could potentially fail if the user provided some garbage structure; the chances for it to be garbage would be much lower in the second case when it's all organized into proper objects instead of collections of equally sized arrays of fields).

aldanor commented 5 years ago

And another quick question, is it ok to completely ignore some ONNX attributes that don't affect the course of computation or the output in any way? (so that we don't even include them into our types)

kali commented 5 years ago

All right, let me give you a bit of background to answer that question, and draft a bit a future op internals doc.

Operations and model lifecycle

Before being able to run, a model an an operator goes through the following stages:

load (typically from protobuf)
type and shape inference (aka analyze)
normalization
optimization
codegen
optimization
plan and state creation, execution

Let's look at an example. Imagine a small DNN network with a two nodes:

a const op (representing the trained and frozen weights)
a MatMul binary operator, taking the weight as one input, and the model input as its second input. Its output is the model output.

At 1, we will just load and translate the protobuf in as semantic a network as possible: tract representation for inputs is to have them in the network as a Source operator. So we now have three nodes in the network. At this stage MatMul is an empty struct, it does not "know" anything about the input and output sizes and shapes.

pub struct MatMul {}

At 2, we will propagate the information about types and shapes to annotate all tensors flowing in the network. This is where the rules() method is called. At the end of this stage, hopefully, the MatMul op input and output size are determined, and the consistency with the weight array geometry has been verified for the MatMul op.

At 3, we will use the analysis results to eliminate as many constant nodes as we can, and specialized the operators: concretely, the reduce method (the name sucks) will be called on MatMul, with an input specification of the form saying something like (input 1, variable: f32, [4x3], input 2, constant, f32 [3x2] [[1.0, 2.0],[3.0,4.0]]). The binary MatMul will transform itself into a MatMulUnaryA at this stage, storing the weight value internally.

pub struct MatMulUnaryA {
    b: Tensor,
}

During 4, some transformers are going over the network to perform to optimisations that require looking at more than one operation at a time.

At 5, codegen, reduce is called again. We will tranform some operators further to make them ready to run as efficiently as possible. MatMulUnaryA becomes MatMulUnaryImplASimpleB. It assumes a form that is now dependent on the data to be manipulated, stores some values that are used repeatedly in the computation, and prepare (pack) the constant matrix B to a form that suits the multiplier kernel the best.

pub struct MatMulUnaryImplASimpleB<T: Copy + Datum + Add + Mul + Zero> {
    geo: Geo<T>,
    packed_b: Tensor,
    a_shape: TVec<usize>,
    c_shape: TVec<usize>,
}

At 6, we have a chance to perform some more multiple-ops optimisations.

And finally 7 is relatively straightforward at this point.

So now, back to your question :) where should you transform the structure ? I think in your case, unless I am mistaken, you don't need to wait for the analysis results to put the classifier trees in the right form. You can actually do that as early as the constructor. On top of that, the somewhat convoluted column storage for the tree and class is a direct artefact of the protobuf serialization, so there is no reason for it to exist past the parsing and decoding time.

kali commented 5 years ago

One remark about the TVec: they are meant to be used for tiny vectors only (typically 4 elements or less), even if they can tolerate bigger vectors. I use them mostly for list of inputs (most operators once transformed are unary or binary) and outputs, or tensor shapes (most tensor are not above 4D).

But in your case, they would be counterproductive. Fortunately, converting from a TVec to a regular Vec is relatively cheap (data will not move if size is above 4). So having protobuf methods returning tvecs is fine (as most attributes are tiny lists) but prefer regular Vec as fields for the trees and nodes (Box<[]> would be even more compact, but it feels a bit over the top).

kali commented 5 years ago

As for the runtime structure, that you should build as early as the constructor, I think I would actually go for something like that (trying to make the Node as compact as possible to get some good data location....)

pub enum Node {
   Leaf(Vec<ClassVote>),
   Ge(Branch),
   Gt(Branch),
// ...
}

pub struct Branch {
  feature: usize,
  value: f32,
  right_child: usize, // <- not strictly necessary, but allows to jump faster
}

pub struct Tree {
    nodes: Vec<Node>,
    ...
}

pub struct TreeEnsembleClassifier {
    trees: Vec<Tree>,
    ...
}

aldanor commented 5 years ago

Thanks for taking time to explain all this! Will take it all into account.

As for structure, I was thinking probably like something like this (still playing around with it):

pub enum Cmp { Ge, Gt, ... }

pub struct BranchNode {
    cmp: Cmp,
    feature_id: usize,
    value: f32
    true_id: usize,
    false_id: usize,
}

pub enum Node<L: Leaf> {
    Branch(BranchNode),
    Leaf(L),
}

(mostly because tree classifier/regressor are almost the same and differ only in what happens at the leaves. However, I'm still reading through onnxruntime code to see all the corner cases etc, so this may be changed.)

As for the "codegen" stage you've mentioned, although there's not a whole lot you can do here, but there's still a bit I had in mind (for future):

many (most) emitted ONNX trees would only have one type of branch comparison; for example, LightGBM only emits BRANCH_LEQ. Knowing this, it might make make sense to check for it early and create monomorphized "specialized" tree in order to avoid extra branching at runtime.
nodes can be reordered within trees to optimize data locality when jumping; this is not very trivial but can be done (again, later).

kali commented 5 years ago

Yeah, don't worry too much about the late optimizing steps for now. I think you can more or less do everything early anyway, as I don't think you will need the analyze insights for anything.

About the node ordering, maybe there is something I do not get, but I can't see where the node kids are encoded, so I assumed the ordering was just depth-first. This ordering may actually be quite good for locality, and you can save one of the two kid ids, as your first kid is always the node immediately following you...

aldanor commented 5 years ago

As for the ordering, you have true_id and false_id for all branch-type nodes that comes from protobuf, there’s no particular ordering assumed or provided other than that. (I think under the spec no one stops you from having cycles/loops, or orphan nodes, or multiple roots, etc - I wanted to check for all of that in the tree constructor so be can calm about it afterwards)

kali commented 5 years ago

Ha yeah indeed. missed them.

aldanor commented 5 years ago

Just an update, I hope to have a working version of both TEC and TER this week, at least the core part of the evaluator (the op rules will take a bit, with all the validation etc). I've initially started with the protobuf/config stuff, but it turned out to be a bad idea. Instead I'm now implementing the core ensemble evaluator first, looking at some of the onnxruntime C++ code as well for guidance, and the main part of it is already done (need to integrate custom score reduction functions and all of that jazz).

There's a few open questions at the moment...

If you look at the spec, for the tree ensemble regressor it says that "All trees must have their node ids start at 0 and increment by 1", and onnxruntime makes this assumption in their code as well, but for the classifier it doesn't. Moreover, onnxruntime code makes assumption that tree ids also start from 0 and can only increase by one, although it's not listed anywhere in the code. The ensembles that the existing converters (onnxmltools) emit indeed follow these rules too - I wonder whether we should require these to be satisfied as well? (makes the parsing from config part a bit easier)
The spec doesn't say anything about where the root of each tree is; onnxruntime scans each tree for a root. But in onnxruntime implementation, as I see it, nothing's stopping a single tree from having multiple roots, which is weird... In the existing converters, the root is unique per tree, there are no orphan nodes, and the root is node number 0 in each tree - again, I wonder if we should require that? (similar to above, that simplifies parsing from config)
The "average score per target" function in the ensemble regressor in onnxruntime is implemented like this - basically it's a sum(score[j] / n for j in range(n)). This makes me a bit uncomfortable since you will accumulate floating point errors. Should it rather be sum(score) / n? (this complicates computations a bit since now you have to maintain state per target, but still)

kali commented 5 years ago

Hi !

Thanks for the updates. My ML colleagues here are pretty excited by the thought of tract supporting -ML operators, so this is all good news :)

In order to optimize use of limited resources (aka me), my stance on all onnx implementation has basically been to trust the test-suite, and only the test-suite. If a feature is not covered by the test suite, I do not implement it, period. Regardless what the documentation says. One of the reason for that is, the specification is, imho, very greedy. For instance I have more or less never seen an operational ML model in f64. Sometimes, this stance it verges on absurd: I have implemented enough of the LSTM to covers the three tests, one of them not testing anything. With my usual stance I should consider LSTM done, but... well.

The onnx coverage is even worse on the -ML side of things. So, just like for the tensorflow coverage, the idea is to substitute / completement the onnx test by real models we can test against (either inline, op per op, as I did for TF, or globally, whatever works). And we limit implementation to cases covered by the tests we could reasonably find in the wild.

So :) I think we will be happy with a more restrictive spec than what the doc says. If both onnxruntime or the converters are making an assumption, we will accept it as the de facto norm. We will fix it when somebody exhibits a model breaking it.

So we will assume 1/ the ids are well behaved in both operators, 2/ tree are actually trees, not DAGs or worse.

As for the average... iiiirk. Please do it the right way. We have enough bugs (or loss of accuracy) on our own, we don't want to duplicate the ones introduced by other implementation. (On top of the loss of accuracy, it performs n divisions... division is not cheap, what are they thinking ?)

aldanor commented 5 years ago

Thanks - I agree with you on all points, let's make it more realistic which will end up being more restrictive than a somewhat (very) vague doc.

TBH I don't fully understand the doc sometimes, or it's ambiguous. For example I was trying to understand the TreeEnsembleRegressor aggregation function... As I was trying to read onnxruntime implementation, I realized it's completely wrong and it looks like the author meant to write one thing, but wrote something else which is bugged (I've filed an issue, let's see what they have to say).

Basically, according to the doc, the aggregation function "defines how to aggregate leaf values within a target" (sum/average/min/max). So... given the input, we walk down all of our trees and end up at some leaves. Each leaf has target_id (index of the output variable) and weight (leaf value to be aggregated). So, for a particular target, for example, for average aggregation function, we sum up all the leaf values for that target and divide it... by what? By the number of leaves that we've hit who have matching target_id? But then this is path-dependent in general because a single tree may contain leaves pointing to different targets... or it's not allowed? Note also that a single leaf may actually be represented as multiple leaves, so that it can affect multiple targets/outputs at once. In onnxruntime they divide by number of trees but that's wrong, and doesn't match the spec either.

So, I've checked the existing converters for ensemble regressors (lightgbm/xgboost converters in onnxmltools, random forest / gradient boosting / decision tree regressors in onnx-sklearn), and found:

Most of the time (almost always) they hard-code n_targets=1 for ensemble regressors. In this case, aggregation function becomes simply "aggregating over outputs of each tree", that's trivial and path-independent.
The only example that I've found with multi-output ensemble regressors is sklearn random forest regressors, but in that case each leaf has exactly n_targets outputs - that is, each leaf affects each and every one of the targets.

This question (structure of multi-output ensembles) also applies to the classifier ensembles as well actually. So far, I've found only two types of multi-output ensembles; I'll dig a bit more, but maybe we can just limit ourselves to that?

Multi-output ensembles where each leaf in each tree affects all of the targets (like sklearn's RandomForest). This way, aggregation function for the regressor is now path-independent, and the denominator in averaging is simply the number of trees.
Multi-output ensembles where each tree is dedicated only to one target (like LightGBM / XGBoost); that is, all leaves in each particular tree only affect a single target (typically, there's also equal number of trees dedicated to each target). This is because trees in gradient boosting ensembles only separate two classes, so for each boosting round they grow n_targets trees: "0 vs rest", "1 vs rest", "2 vs rest", etc... You could actually represent this kind of ensemble as a tuple of single-output ensembles, one for each target/output.

So, I guess we could either support a collection of special cases, like 1) single-output, 2) multi-output each-leaf-all-targets, 3) multi-output each-tree-single-target... or support the most general generic cases that covers it all - but will be inevitably slower in each particular case (and probably more complex in the implementation).

(I'm also planning to fit some models in Python in xgboost/lightgbm and various sklearn models for both classifiers and regressors, both single-output and multi-output, and then use that as test data for converted onnx representations)

kali commented 5 years ago

I had similar issues with the convolution. A generic case that will handle all cases, and specialized implementations matching common use cases, using the reduce() operation to switch to a specialized implementation when it makes sense.

Once again, I'm trying to be pragmatic. My recommendation would be to first implement one correct generic operation without too much sacrifice for performance. Then we can use it as a reference implementation to test more specialized ones. That way we have something to check for correction, and also a performance baseline. It also allow us to profile and determine if it really make sense to specialize. I try only to specialize when profiling data convinced me it was worth it. Experience show intuition in that area can be very deceptive...

So I would perfectly ok with merging just a decent, correct generic implementation first.