Open pseudotensor opened 6 years ago
Also means can overlap multiple XGBoost runs and get about 3X as many models completed for normal GBM. Useful for DAI, because typically 8 models and only (say) 2-3 GPUs, but then can run those in parallel instead of sequentially giving nice boost.
Use Nvidia MPS and tune kernel core count to per-kernel optimal (found don't need to use all cores before). Then MPS will allow non-serial (i.e. parallel) overlap of kernels and use cores more efficiently. This works if not memory bound or of lots of memory contention one wants to hide. I found before the GPU xgboost has lots of memory stalls (70% of time!) so should be able to do 3X kernels for 3X faster random forest (i.e. do 3 trees in parallel). But with kernel core count reduction, might be able to squeeze more and get 5X performance or so.
@RAMitchell