Closed mirekphd closed 1 year ago
Thanks for the suggestion! For GPU execution, Treelite compiled models obviously would not work, and we would expect better performance via FIL on GPU than Treelite compiled models on CPU.
On CPU, the picture is not as clear. The current CPU FIL implementation has pretty good performance but can be beat by Treelite compiled models under some deployment scenarios. We are just about to move to a new CPU FIL implementation, however, where we expect it to outperform Treelite compiled models in most deployment scenarios (though not in absolutely 100% of cases). Even where Treelite compiled models continue to outperform, we expect the perf differential to be much narrower for now and to disappear in the near future.
Given the limited expected benefit, it is worth considering the downsides of supporting Treelite compiled models. Far and away the biggest concern would be security of loading arbitrary libraries like that. While that might be okay in an environment with tight controls on the introduction of new models, the future of Treelite's model compilation is also relevant. Treelite model compilation may be dropped in the future (in part due to the performance of the new CPU FIL implementation), and even if it is not, it would be difficult for it to keep up with new features added to Treelite in general as well as the training frameworks.
With all that in mind, I do not see a compelling case for supporting pre-compiled Treelite models. If there are other aspects of this that I'm not considering, please feel free to reopen this and add additional context.
XGBoost and LightGBM models compiled with Treelite (using
model.export_lib()
method andgcc
compiler, which is very fast and parallelized) and then re-imported from dynamic libraries (.so
files under linux) and represented in python astreelite_runtime.predictor.Predictor
objects generate offline predictions (in python scripts) noticeably faster than from standard python model objects (e.g.lightgbm.sklearn.LGBMClassifier
).Why is therefore such compilation to machine code, with prior conversion to Treelite objects [1], not used / supported before importing models to Triton itself? Wouldn't it offer a similar level of performance improvement over text-based models that are currently being imported? I'm pretty sure compilation is not used now, because in case of
sklearn
models it would take a noticeable amount of time - a few minutes if multiprocessing works correctly (LightGBM and XGBoost would compile 10x faster, but still noticeably slower than the currenttritonserver
startup times).[1] The conversion to
treelite.frontend.Model
for XGBoost and LightGBM Boosters (using `model.from*Treelite converter methods) and also for most of
sklearn.ensemblemodels (using a long list of dedicated methods such as
treelite.sklearn.SKLRFClassifierConverter.processmodel()` for Random Forest classifiers, etc.)