Description

@kingfengji Thanks for making the code available. I believe that multi-layered decision trees is a very elegant and powerful approach! I was applying your model to the boston housing dataset but wasn't able to outperform a baseline xgboost model.

Details

To compare your approach to several alternatives, I ran a small benchmark study using the following approaches, where all models have the same hyper-parameters

baseline xgboost model (xgboost)
mGBDT with xgboost for hidden and output layer (mGBDT_XGBoost)
mGBDT with xgboost for hidden but with linear model for output layer (mGBDT_Linear)
linear model as implemented here (Linear)

I am using PyTorch's L1Loss for model training and use the MAE for evaluation, where all models are trained in serial mode. Results are as follows

In particular, I observe the following

irresepective of the hyper-parameters and number of epochs, a basline xgboost model tends to outperforms your approach
with increasing number of epochs, the runtime for an epoch increases considerably. Any idea as to why this happens?
using mGBDT_Linear,
- I wasn't able to use PyTorch's MSELoss since the loss exploded after some iterations, even after normalizing X. Should we, similar to Neural Networks, also scale y to avoid exploding gradients?
- the training loss starts at exceptionally high values, then decreases before it starts to increase again

Additional Questions

Given that you have mostly been using your approach for classification tasks, is there anything we need to change before we use it for regression tasks, except the PyTorch Loss?
Besides the loss of F, can we also track how well the target propagation is working by evaluating the reconstruction loss of G?
When using mGBDT with a linear output layer, would we expect to generally see better results compared to using xgboost for the output layer?
What is the benefit of using a linear output layer compared to a xgboost layer?
For training F and G, you are currently using the MSELoss for the xgboost models. Do you have some experience with modifying this loss?
What is the effect of the number of iterations for initializing the model before training?
What is the relationship between the number of boosting iterations (for xgboost training) and the number of epochs (for MGBDT training)?
In Section 4 of your paper you state "The experiments for this section is mainly designed to empirically examine if it is feasible to jointly train the multi-layered structure proposed by this work. That is, we make no claims that the current structure can outperform CNNs in computer vision tasks." So as a question, would that mean that your intention is not to outperform existing Deep Learning based models, say CNN, or to outperform existing GBM-models, like XGBoost, but rather to show that a Decision Tree based model can be also used for learning meaningful representations that can then be used for downstreaming tasks?
Connected to the previous question: Gradient boosting models are already very strong learners that obtain very good results in many applications. So what would be your motivation of using multiple layers of such a model? May it even happen that, based on the implicit error correction mechanism of GBM, training several of them leads to a drop in accuracy?

Code

To reproduce the results, you an use the attached notebook.

ModelComparison.zip

@kingfengji I would highly appreciate your feedback. Many thanks.

kingfengji / mGBDT

Performance of your model on regression tasks #8

Description

Details

Additional Questions

Code