szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

Rborist #20

Closed szilard closed 8 years ago

szilard commented 8 years ago

Thanks @suiji for Rborist code. If I run it with 100 trees as in https://github.com/szilard/benchm-ml/tree/master/z-other-tools (on 32 core box) I get: Time: 87 sec AUC: 66.43 Something is wrong, the AUC is very low.

I checked out latest github version, then in ArboristBridgeR/Package dir i ran ./dev.sh which created Rborist.tar.gz and then I installed with R CMD INSTALL

szilard commented 8 years ago

FYI I cleaned up the code a bit: https://github.com/szilard/benchm-ml/blob/master/2-rf/9a-Rborist.R but of course the results above don't change. Any idea why AUC is so low?

suiji commented 8 years ago

Please use the CRAN.sh script to build, for which -O2 should be enabled by default. Script "dev.sh" is only for tweaking development builds; I will make a note of this somewhere, so that other users do not suffer this indignity.

i) Speed problems may be due to compiling at low optimization level. At -O2, I repeatedly see 500 trees complete in about 80 sec on case 0.1M and 800+ sec. on 1.0M.

If things still do not improve on the 32-core system, an educated guess would be that there are false-sharing problems. If so, I will need to take a closer look at the OMP-driven loops.

ii) Will need to investigate low AUC. What is a "good" level for 500 trees?

Thank you. mls

On 2015-07-23 13:02, Szilard Pafka wrote:

FYI I cleaned up the code a bit: https://github.com/szilard/benchm-ml/blob/master/2-rf/9a-Rborist.R [1] but of course the results above don't change. Any idea why AUC is so low?

Reply to this email directly or view it on GitHub [2].

*

Links:

[1] https://github.com/szilard/benchm-ml/blob/master/2-rf/9a-Rborist.R [2] https://github.com/szilard/benchm-ml/issues/20#issuecomment-124224906

suiji commented 8 years ago

On 2015-07-23 12:56, Szilard Pafka wrote:

Thanks @suiji [1] for Rborist code. If I run it with 100 trees as in https://github.com/szilard/benchm-ml/tree/master/z-other-tools [2](on 32 core box) I get: Time: 87 sec AUC: 66.43 Something is wrong, the AUC is very low.

Oh, I see: your results are for the 1M case - not 0.1M, as I had mistakenly surmised. Extrapolating to 500 trees, then, would imply somewhat less than 450 sec., which is actually not so shabby. We can table the CRAN.sh and false-sharing topics for this discussion, then, and I shall focus on understanding the AUC deficiency.

Thank you, mls

szilard commented 8 years ago

OK, redone with ./CRAN.sh

Btw let's use 1M training set and 100 trees. Take a look at https://github.com/szilard/benchm-ml/tree/master/z-other-tools for a comparison (time and AUC) with other tools.

I get now with Rborist: Time: 88 sec AUC: 66.91 (so essentially the same)

I'm using this code: https://github.com/szilard/benchm-ml/blob/master/2-rf/9a-Rborist.R

suiji commented 8 years ago

Looks like the AUC problem may be similar to that with Spark:
prediction probabilities are derived from votes instead of from leaf weights. This will be changed before the next CRAN release and, one hopes, will improve the benchmark results.

On 07/23/2015 01:30 PM, Szilard Pafka wrote:

OK, redone with |./CRAN.sh|

Btw let's use 1M training set and 100 trees. Take a look at https://github.com/szilard/benchm-ml/tree/master/z-other-tools for a comparison (time and AUC) with other tools.

I get now with Rborist: Time: 88 sec AUC: 66.91 (so essentially the same)

I'm using this code: https://github.com/szilard/benchm-ml/blob/master/2-rf/9a-Rborist.R

— Reply to this email directly or view it on GitHub https://github.com/szilard/benchm-ml/issues/20#issuecomment-124231157.

szilard commented 8 years ago

Great, thanks.

szilard commented 8 years ago

Closing this issue, feel free to reopen if you have any new results.

szilard commented 8 years ago

Thanks for code in pull request, will test it soon, re-opened the GH issue.

suiji commented 8 years ago

Thank you for re-opening the issue.

The current GH version has a better memory footprint and may be a tad faster.

The probability census now reports leaf-weighted scores, as opposed to normalized votes. AUC scores are only marginally higher, however. This may be because most (~75%) leaves are "pure", in the sense that only a single category is reported, and hence undertake no weighting. The real problem for Rborist may derive from a high(ish) false-negative rate, as exhibited by confusion matrices produced during validation. It would be instructive to look at confusion matrices produced by some of the other packages for this data set. That said, however, even models with very high false-negative validation rates can still give rise to higher AUC values with the ROCR package.

suiji commented 8 years ago

It looks like Rborist is overfitting. Adjusting some of the default parameters make this less of a problem, but it still cannot touch an AUC of 70.0. Note that AUC values seem to match those for the randomForest package when hot-one encoding is used but, in that case, the misprediction rate for "Yes" is abysmally high: IOW, hot-one encoding produces forests that vote "No" nearly always.

Rborist does not currently employ binning. This may be how H20 and Xgboost avoid overfitting. Will investigate, but binning will not make it into the upcoming CRAN release.

szilard commented 8 years ago

Thanks for providing the above info. Is the AUC < 70.0 on the 1M dataset and 100 trees?

Btw as far as I know xgboost does not use binning. Binning is typically used in the distributed products such as H2O or Spark MLlib.

If you get low AUC it can be underfitting/bug, not necessarily overfitting I think.

suiji commented 8 years ago

The problem went away with a recent refactoring of the code. Currently seeing AUC > 74.0 on 1M samples and 300 trees.

Agree with your assessment that overfitting was a bad diagnosis. The shabby AUC appears to have been due to failure to remap level values on separate prediction.

Thank you for providing a very instructional test.

szilard commented 8 years ago

That's great, I re-ran now and got AUC 74.1. However, there seems to be a slow down vs last time. 1M records/100 trees/32 cores used to be 87s (see above), now 266s.

suiji commented 8 years ago

Pretty sure that the 87s timing was observed when restricting to 20 levels - "nLevel=20", in the case of Rborist. That said, there has been a slowdown, noticeable even at twenty levels. This may be related to the run-sorting mechanism, which is naive but should probably be incremental. It could, in addition, be affected by wasteful restaging, although restaging has always been wasteful for classification.

szilard commented 8 years ago

Both were run with https://github.com/szilard/benchm-ml/blob/master/z-other-tools/8a-Rborist.R i.e. defaults for params, maybe the defaults have changed?

suiji commented 8 years ago

Possibly. Anyway, thank you for checking.

szilard commented 8 years ago

and I'm timing just the training not the predictions.

suiji commented 8 years ago

Looks like a recent change precipitated the decline in performance. It has been corrected. Performance quality was not being logged with source code changes. This, too, has been corrected.

Thank you for your patience.

szilard commented 8 years ago

Now 32core/1M/100tree/depth=20: time=160sec AUC=74.0 I added Rborist here: https://github.com/szilard/benchm-ml/tree/master/z-other-tools

Btw htop/mpstat shows only about 15-20% CPU utilization (32 cores)

suiji commented 8 years ago

Are you certain you have the latest Github release? Those results sound like the one from a few days ago.

No need to restrict to "nLevel =20" for the newest version.

Thanks, mls

On 02/17/2016 05:49 PM, Szilard Pafka wrote:

Now 32core/1M/100tree/depth=20: time=160sec AUC=74.0 I added Rborist here: https://github.com/szilard/benchm-ml/tree/master/z-other-tools

Btw htop/mpstat shows only about 15-20% CPU utilization (32 cores)

— Reply to this email directly or view it on GitHub https://github.com/szilard/benchm-ml/issues/20#issuecomment-185503135.

szilard commented 8 years ago

Yeah, I git pulled just then (from master). I use depth=20 for all other tools. You can run the same code on EC2 r3.8xlarge. https://github.com/szilard/benchm-ml/blob/master/z-other-tools/8a-Rborist.R

suiji commented 8 years ago

Not sure what is going on. I observed 165 seconds on a 4-core with no depth restrictions. The 4-core usually tracks the 32-core by 2x, suggesting 80-85s there.

The low core occupancy might be a clue. Have you observed this before on the 32-core?

szilard commented 8 years ago

I don't remember, I looked at so many tools and not always took rigorous notes on mpstat/vmstat readings. Maybe you can look at 4c vs 16/32c with current version vs some older version (all 4 combos).

szilard commented 8 years ago

and I would keep it at 1M records/100 trees/depth=20

suiji commented 8 years ago

Yes - will try it on an 8-core here and see how it behaves, then try the beefier resources.

FWIW: There are two phases which make heavy use of OMP: splitting should already be using all cores, and it would be a big surprise if it suddenly stopped doing so. The other big client is restaging which, at the moment, only parallelizes by predictor so, with 8 predictors, would yield 8/32 occupancy. Restaging tends to dominate in the upper branches, so this could (one hopes) be the effect you are seeing. It should be possible to open up restaging to all cores without too much work.

Validation should be using all cores already, but this is typically only the final few percent of the execution time.

szilard commented 8 years ago

Great, and thanks for insights. I'm curious how it works out. Lmk.

suiji commented 8 years ago

There is a new version on Github that lifts the by-predictor constraint on restaging. IOW, restaging should now be able to use all cores, as can splitting and validation/prediction already.

For the 1M sample/100 tree case, an old 4-core Phenom gives: 145s for 20 levels and 180s for unconstrained trees.

On a recent-vintage 8-core Intel, I cloned the Github project just to ensure that the build would be clean. The performance monitor shows all cores busy, albeit oscillating between 50% and 100% from tree to tree. The timings are 66s for 20 levels and 79s unconstrained, both obtained by modifying the script in the "other tools" directory.

I have not tried using a 32-core server yet, but do believe these results are real. In particular, AUC is 74.0+/-.

szilard commented 8 years ago

32c (1M records/100 trees/depth=20) 66sec AUC 73.8 :)

I wonder if this much tuning makes things better in general or just for this dataset (kinda "overfitting"). On the other hand, it would be better of course if I used in the benchmark a variety of say 10 datasets of different structure etc.

Anyway, here you go: https://github.com/szilard/benchm-ml/commit/a3fc7e7ef979a9bf02e841cd2ef53ddc6c2d8412

suiji commented 8 years ago

Great - thank you very much.

Actually, it was more a matter of "detuning": Arborist was originally conceived as a tool for high predictor count and high-ish 'mtry', with restaging offloaded to a coprocessor. Hence the focus on predictor-level parallelization at the expense of node-level. All that happened in the most recent change was to expose the potential to exploit additional opportunities for parallelization. It's entirely possible that some applications may suffer as a result but, to this seems like the step in the right direction.

More along the lines of tuning would be trying to predetermine an appropriate level count or minimum information gain, node size, etc.

szilard commented 8 years ago

OK

Looks good for now :)