triton-inference-server / fil_backend

FIL backend for the Triton Inference Server
Apache License 2.0
67 stars 35 forks source link

[Feat] Support models compiled to machine code (using compilation methods supplied by Treelite) #341

Closed mirekphd closed 1 year ago

mirekphd commented 1 year ago

XGBoost and LightGBM models compiled with Treelite (using model.export_lib() method and gcc compiler, which is very fast and parallelized) and then re-imported from dynamic libraries (.so files under linux) and represented in python as treelite_runtime.predictor.Predictor objects generate offline predictions (in python scripts) noticeably faster than from standard python model objects (e.g. lightgbm.sklearn.LGBMClassifier).

Why is therefore such compilation to machine code, with prior conversion to Treelite objects [1], not used / supported before importing models to Triton itself? Wouldn't it offer a similar level of performance improvement over text-based models that are currently being imported? I'm pretty sure compilation is not used now, because in case of sklearn models it would take a noticeable amount of time - a few minutes if multiprocessing works correctly (LightGBM and XGBoost would compile 10x faster, but still noticeably slower than the current tritonserver startup times).

[1] The conversion to treelite.frontend.Model for XGBoost and LightGBM Boosters (using `model.from*Treelite converter methods) and also for most ofsklearn.ensemblemodels (using a long list of dedicated methods such astreelite.sklearn.SKLRFClassifierConverter.processmodel()` for Random Forest classifiers, etc.)

wphicks commented 1 year ago

Thanks for the suggestion! For GPU execution, Treelite compiled models obviously would not work, and we would expect better performance via FIL on GPU than Treelite compiled models on CPU.

On CPU, the picture is not as clear. The current CPU FIL implementation has pretty good performance but can be beat by Treelite compiled models under some deployment scenarios. We are just about to move to a new CPU FIL implementation, however, where we expect it to outperform Treelite compiled models in most deployment scenarios (though not in absolutely 100% of cases). Even where Treelite compiled models continue to outperform, we expect the perf differential to be much narrower for now and to disappear in the near future.

Given the limited expected benefit, it is worth considering the downsides of supporting Treelite compiled models. Far and away the biggest concern would be security of loading arbitrary libraries like that. While that might be okay in an environment with tight controls on the introduction of new models, the future of Treelite's model compilation is also relevant. Treelite model compilation may be dropped in the future (in part due to the performance of the new CPU FIL implementation), and even if it is not, it would be difficult for it to keep up with new features added to Treelite in general as well as the training frameworks.

With all that in mind, I do not see a compelling case for supporting pre-compiled Treelite models. If there are other aspects of this that I'm not considering, please feel free to reopen this and add additional context.