Leaf Embedding Model Performance

stevenbinhu21 commented 5 years ago

Hi, I am trying to replicate this work for my own Dataset, which is around 0.2 Million-sized corpus and I trained the GBDT2NN model for a 5-category classification task. I found that the leaf embedding model does not perform well on the testset (accuracy is way lower than GBDT), and as a result, the GBDT2NN model targeting this pretrained leaf embedding performs even worse(30-40% decrease on accuracy).

Since the paper did not presents any evaluation on this leaf embedding learning, I wonder if you could make some clarification on a few things:

The paper presents results only for binary classification and regression, the tree number will be a lot bigger for multi-category classification tasks. So will large tree_number affects the performance? As the initial leaf embedding size is [n_clusters, max_leaves, num_classes, leaf_emb_size], larger number of trees may lead to too much variance in embedding?.
Also if the depth is deeper, leading to very complex tree structures, what adjustion would you recommand for tuning the model for better performance?
For all your datasets, the leaf embdding model learns at most 10epoches(many learns for 2epoches at most), did all of these models outperform their GBDT counterparts by only learning to predict leaf values? Did these models actually converge for such few epochs? Or more training actually hurts the performance( this happens in my experiments)

guolinke commented 5 years ago

@stevenbinhu21 Actually, we didn't use GBDT2NN for multi-class yet. And I think the current solution cannot directly transfer to the multi-class problems. As the trees for different classes are much different, it is hard to say, putting them in the same cluster is good or not. Maybe you can try to separate the different classes into different clusters.

BTW, Does "corpus" refer to the text data? It is all the high dimensional categorical features in text data, and I think it is hard for the tree model to learn from.

as for you questions:

yeah, more trees are harder to learn.
You may need to increase the leaf emb size, even increase the number of clusters.
I think we didn't train them to the convergence. ping @motefly for more experiment details.

BTW, please note the motivation of this work is for the online learning pain point in tree-based model. For the offline performance, its upper bound is similar to the NN + (GBDTLR) model.

stevenbinhu21 commented 5 years ago

Thanks for the response. I did a bit more experiments and found that my trees grow too deep and too complex. By limiting the number of leaves and make early stop(at most 3 epoches) when learning leaf embedding, given the same tree number, GBDT2NN model can slightly outperform its GBDT counterparts after around 50 epoches. And increasing leaf emb size does not improve much, but increasing the number of clusters does help (500 trees -> 20 clusters perform much better than 10 clusters). I guess grouping too many different-class trees in the same cluster indeed makes it harder for NN to learn from.

BTW, as GBDT is additive tree, I simply reshape the tree indices to [num_clusters*num_trees_per_cluster, num_classes] and randomize the cluster tree selection along the first axis, then just treat each leaf emb NN as a multiclassification model and ensemble these models for the final prediction. But what you suggest sounds like a more intuitive alternation, I will try clustering trees by classes and see how it goes.

Again, really appreciate for the help. I will close this issue for now.

motefly / DeepGBM

Leaf Embedding Model Performance #10