Open nwalexander opened 2 weeks ago
I add an optimization, which you can turn on with
System.setProperty("smile.regression_tree.bins", "200");
before calling the training algorithms (RegressionTree, RandomForest, or GradientTreeBoost).
You may adjust the number of bins to find a good balance between speed and model quality. Please try it with master branch. thanks.
I turn on this optimization by default with bins = 100
. Only if you want a different bins
value, you may call System.setProperty("smile.regression_tree.bins", "200")
.
I am using Smile to train a Regression Tree (RegressionTree.java). I noticed that when the size of training data is increased (to ~100 million records) the training time is significantly increased comparing to training the same model in Python using sklearn; in fact, training using Smile is 15 times slower than sklearn.
The parameters are set identical to those of scikit-learn.
I also notice when commenting out the shuffle() call in split() (CART.java, line 307), the training runtime is significantly reduced, but the performance is hindered.
Do you have any idea why this may be the case? What suggestions do you have for improving Smile's performance to a comparable speed? Is there any way to optimize the shuffle algorithm, while maintaining similar performance as sklearn?