Optimize xgboost random forest for GPU

Use Nvidia MPS and tune kernel core count to per-kernel optimal (found don't need to use all cores before). Then MPS will allow non-serial (i.e. parallel) overlap of kernels and use cores more efficiently. This works if not memory bound or of lots of memory contention one wants to hide. I found before the GPU xgboost has lots of memory stalls (70% of time!) so should be able to do 3X kernels for 3X faster random forest (i.e. do 3 trees in parallel). But with kernel core count reduction, might be able to squeeze more and get 5X performance or so.

@RAMitchell

h2oai / h2o4gpu

Optimize xgboost random forest for GPU #412