haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.05k stars 1.13k forks source link

RegressionTree is Training Extremely Slow When Using Large Datasets #791

Open nwalexander opened 2 weeks ago

nwalexander commented 2 weeks ago

I am using Smile to train a Regression Tree (RegressionTree.java). I noticed that when the size of training data is increased (to ~100 million records) the training time is significantly increased comparing to training the same model in Python using sklearn; in fact, training using Smile is 15 times slower than sklearn.

The parameters are set identical to those of scikit-learn.

I also notice when commenting out the shuffle() call in split() (CART.java, line 307), the training runtime is significantly reduced, but the performance is hindered.

Do you have any idea why this may be the case? What suggestions do you have for improving Smile's performance to a comparable speed? Is there any way to optimize the shuffle algorithm, while maintaining similar performance as sklearn?

haifengl commented 22 hours ago

I add an optimization, which you can turn on with

System.setProperty("smile.regression_tree.bins", "200");

before calling the training algorithms (RegressionTree, RandomForest, or GradientTreeBoost).

You may adjust the number of bins to find a good balance between speed and model quality. Please try it with master branch. thanks.

haifengl commented 3 hours ago

I turn on this optimization by default with bins = 100. Only if you want a different bins value, you may call System.setProperty("smile.regression_tree.bins", "200").