szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

xgboost RF bump for n=10M #14

Closed szilard closed 9 years ago

szilard commented 9 years ago

Moved "something weird happens for the largest data size (n=10M) - the trend for Run time and AUC "breaks", see figures main README" issue from https://github.com/szilard/benchm-ml/issues/2 here.

szilard commented 9 years ago

@tqchen says: "I now think the bump in running time was due to cache-line issues. As there are some non-consecutive going on xgboost. Having larger amount of rows could mean less cache hit rate, but the impact should not be large as this has things to do micro level optimization.

I have pushed some optimization to do prefetching, which should in general improve the speed of xgboost. Would be great if you want to run another round of test."

tqchen commented 9 years ago

Thanks, I have to note that the bump in trend is still likely to exist, but the impact should be limited due to the micro level thing I mentioned. Just that we know the cause of this phenomenon:)

tqchen commented 9 years ago

As for the AUC part, I find that at least in terms of boosting, seems treating all the dates and times as integer gives definitely better result.

szilard commented 9 years ago

I think that's a reasonable explanation. I re-ran it and there was a significant improvement for n=10M (from 4800sec to 3000sec). The Time vs size curve is still convex though (see updated graphs in README), but your previous comments can be an explanation for this.