Update Latest version of XGBoost

szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

MIT License

1.87k stars 334 forks source link

Update Latest version of XGBoost #37

Closed tqchen closed 8 years ago

tqchen commented 8 years ago

Thanks to this benchmark, we now have a good understanding of what is going on in #14 in #2 Specifically, cacheline related issues for exact greedy algorithm. See our detailed analysis in this paper http://arxiv.org/abs/1603.02754

In short, exact greedy algorithm will suffer from cache line issues when facing dataset larger than 1M, which we can counter balance with pre-fetching but still not perfect.

We are adding a new option call tree_method to xgboost, which will allow user to choose the algorithm. By default it will choose the faster one, and will send a message to user when approximate algorithm is choosed. I think it might be interesting to rerun the benchmark on this latest version.

See https://github.com/dmlc/xgboost/tree/master/R-package for instructions. The drat or install from source should work. To confirm you are using the latest version, check if the message occur when running on 10M data

Tree method is automatically selected to be 'approx'...

szilard commented 8 years ago

Thanks @tqchen I saw the paper last week and I was thinking about the same.

I think btw this is an excellent read for everyone interested in ML, the most accurate methods and the best implementations (fast, low RAM footprint). So, here you go again the link: http://arxiv.org/abs/1603.02754

I'll re-run the benchmark with the newest version soon.

szilard commented 8 years ago

I re-ran the "absolute min benchmark" https://github.com/szilard/benchm-ml/tree/master/z-other-tools with latest xgboost (from drat repo). Results are the same as previously, run-time ~30sec.

tqchen commented 8 years ago

Was it on a 1M dataset? the difference will only happen when you run the benchmark on 10M scale dataset

szilard commented 8 years ago

Yes, I did. However, I ran it now on 10M dataset too, and got same run-time as before (~600sec with tree_method = "exact").

tqchen commented 8 years ago

the speed improvement was mainly on the approximate binning algorithm, which will be picked automatically if tree method is not set for large dataset. On Thu, May 26, 2016 at 5:45 PM Szilard Pafka notifications@github.com wrote:

Yes, I did. However, I ran it now on 10M dataset too, and got same run-time as before (~600sec with tree_method = "exact").

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/szilard/benchm-ml/issues/37#issuecomment-222033319, or mute the thread https://github.com/notifications/unsubscribe/ACdUIAuVIoz0OOQ2sGrhSHKoZFRSSlX6ks5qFj6dgaJpZM4HvnBr .

szilard commented 8 years ago

Yes, the same code as before in the new version picked up binning automatically: Tree method is automatically selected to be 'approx' for faster speed. to use old behavior(exact greedy algorithm on single machine), set tree_method to 'exact'

If I let do that, then on 10M rows:

             run-time     AUC
binning:        830s      0.757
exact:          600s      0.764

So, weirdly binning is slower (and less accurate).

szilard commented 8 years ago

Btw, there is only 2 numeric columns in this dataset, the rest are categorical. I guess binning applies to numeric columns only, right?

tqchen commented 8 years ago

I see, it could due to the fact that under the statistics of this dataset, binning was less desirable. The dataset I used contains more numerical columns, which could make some difference. In general the approximate algorithm enjoys better scaling properties, but may have a bit worse initial speed perf.

Thanks for doing the benchmarks.

szilard commented 8 years ago

Well, xgboost is already damn fast. Looking forward to meeting you and the meetup talk+workshop next week.

tqchen commented 8 years ago

Sure:) For some contexts with respect to the perf on this thread, in case others read this thread.

On each process xgboost parallelizes over columns, which works well with enough balanced columns and lower memory footprint.
- But it can suffer a bit when there are less number of columns
The exact greedy algorithm suffers from cache miss effect for more rows, and can be solved by resorting to approximate version on large data.
The distributed version of xgboost parallelizes over rows between each process, while inside each thread parallelizes over columns for lower memory overhead, so for large data with limited amount of columns, use more workers and less thread per worker can alleviate the problem

szilard commented 8 years ago

Thanks for writing down the above.

For categorical columns, do you count them as before or after 1-hot encoding expansion (aka 6 or ~600 columns in case of the airline dataset)?

tqchen commented 8 years ago

yes, then categorical treatment is the same among all methods On Thu, May 26, 2016 at 9:50 PM Szilard Pafka notifications@github.com wrote:

Thanks for writing down the above.

For categorical columns, do you count them as before or after 1-hot encoding expansion (aka 6 or ~600 columns in case of the airline dataset)?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/szilard/benchm-ml/issues/37#issuecomment-222059660, or mute the thread https://github.com/notifications/unsubscribe/ACdUIKJK2tlOILTYpQQoBpAlTMf9AOlcks5qFngXgaJpZM4HvnBr .

szilard commented 8 years ago

I meant in the case of the airline dataset, which has 6 categorical columns that get expanded by 1-hot encoding to 600 columns (though represented in a sparse matrix format), do you consider that as 6 or 600 columns for the purpose of parallelization?

tqchen commented 8 years ago

It will be 600 columns

szilard commented 8 years ago

OK, thanks for info.

suiji commented 8 years ago

For some contexts with respect to the perf on this thread, in case others read this thread.

I'm sorry to be missing your talk. Please do send up a flare if you are ever speaking in Seattle.

You've mentioned prefetching both in this thread and in the preprint. Does this buy you much performance? We've experimented with prefetching in the Arborist for loops with irregular access, but so far have seen speedups of only a few percent. Do you worry about portability?

Thank you.

szilard commented 8 years ago

Good questions @suiji I also saw you just gave a talk at R Finance conference http://www.rinfinance.com/agenda/2016/talk/MarkSeligman.pdf :)

tqchen commented 8 years ago

@suiji prefetching only help the external memory case, i.e. the data is too big to store in memory and need to be stored in disk

suiji commented 8 years ago

prefetching only help the external memory case, i.e. the data is too big to store in memory and need to be stored in disk

Thank you. Can you answer a few more questions?

i) Concerning portability:

You point out that the airline data crosses a threshold in the memory hierarchy somewhere near 1M rows. Even distilling liveness and branching information down to single bits, data locality takes a hit at this threshold because memory access is completely irregular. How do you determine the threshold location? Do you poll the hardware to determine size of the L1 cache, for example? Do you autotune, as some packages do, in order to divine the value?

ii) Concerning factor representations:

Earlier in the thread you mention that factors (categories) employ a 1-hot representation. Does Xgboost only employ a 1-hot representation, or is there also a way of encoding factor levels internally? I ask this because the Arborist employs an internal factor encoding. In fact, we are only just now implementing the sparse internal representation needed to benefit from a 1-hot encoding.

ghost commented 7 years ago

Thank you for putting together this amazing library! How can one set tree_method to exact in Python?