Difficulty reproducing benchmarks

kpedro88 commented 2 years ago

I'm interested to compare my own custom reader to this one, but I'm having some trouble getting benchmark results consistent with what the readme reports. Details:

CPU: Intel(R) Xeon(R) W-2295 CPU @ 3.00GHz
Compiler: GCC 9.3.0
Python: 3.9.6 (newer)
xgboost: 1.3.3 (newer)
m2cgen: 0.9.0
ROOT: 6.22.08 (newer)

My results: FastForest: 1.61 s treelite: -* m2cgen: 1.71 s xgboost: 0.12 s TMVA: 8.95 s

The relative differences in these numbers are rather different from the readme (and xgboost is blazing fast somehow).

It could be useful to distribute a Dockerfile that sets up and runs all the benchmarks in a more controlled environment (OS, versions, etc.).

* The treelite benchmark (using version 2.1.0) doesn't work at all. I get the following error:

ModuleNotFoundError: No module named 'treelite.runtime'

If I swap it with treelite_runtime (a separate pip package), then there's another error:

AttributeError: module 'treelite_runtime' has no attribute 'Batch'

guitargeek commented 2 years ago

Hi @kpedro88, thanks for checking out the library and opening the issue!

Indeed, updating the benchmarks was long overdue. I have now added a commit where I updated the benchmark code and results. While doing that, I noticed some performance regression in fastforest in the last year and also fixed that (see the second-to-last commit).

I hope you can reproduce the benchmarks now and that they will make sense to you! The reason why xgboost appeared so fast was that newer versions use multithreading by default, so I needed to explicitly set it to one thread now. Treelike will probably be slower than fastforest, at least that's what I observed. Fastforest and m2cgen are very similar now apparently, in your numbers FastForest was faster and for me it was m2cgen. A few years ago m2cgen was still slower, but I suppose it caught up thanks to improved compilers.

Let me know if these issue is resolved for you or if you want me to follow up on something :) I'm also very curious about your own reader!

kpedro88 commented 2 years ago

Thanks @guitargeek, I'm able to run all the benchmarks now and I get similar, though not exactly the same, behavior on my Intel CPU. I also observe that the results vary more than expected based on the randomly-generated BDT and input data.

Versions:

Compiler: GCC 11.1.0
Python: 3.9.6 (older)
xgboost: 1.5.2
m2cgen: 0.9.0
ROOT: 6.24.06 (newer)

Results: FastForest: 1.34 s treelite: 2.24 s m2cgen: 1.41 s xgboost: 1.33 s TMVA: 9.2 s

kpedro88 commented 2 years ago

I'm just going to summarize the details of my custom reader here, since it turns out not to be faster. (At one time, I had observed it to be faster than the CMS GBRTree, but this was long ago with gcc ~5.)

GBRTree/FastForest stores the trees in separate vectors:

        std::vector<int> rootIndices_;
        std::vector<CutIndexType> cutIndices_;
        std::vector<FeatureType> cutValues_;
        std::vector<int> leftIndices_;
        std::vector<int> rightIndices_;
        std::vector<TreeResponseType> responses_;

and uses them as (inside of the index-updating while loop):

            auto r = rightIndices_[index];
            auto l = leftIndices_[index];
            index = array[cutIndices_[index]] > cutValues_[index] ? r : l;

My idea was to store all elements in a single vector to improve memory locality, originally using a struct, but it can be simplified even further:

    std::vector<int> rootIndices_;
    //"node" layout: index, cut, left, right
    std::vector<int> nodes_;
    std::vector<float> responses_;

The inside of the index-updating while loop then becomes:

                node = &nodes_[index];
                index = features[*node] <= (float&)(*++node) ? *++node : *(node+2);

Surprisingly, this is actually slower than FastForest. It took godbolt, llvm-mca, callgrind, and Vincenzo to figure out why: gcc emits branchless code for the FastForest version, but switches to conditional branching for my version. It's then slowed down by branch prediction misses.

It's not entirely clear why gcc does this, but I eventually found a modification that forces it to be branchless:

                node = &nodes_[index];
                index = *(node+2+(features[*node] > (float&)(*(node+1))));

which results in a similar speed to FastForest (still a bit slower, suffering from a few extra instructions).

Interestingly, clang (12.0.0) emits branchless instructions for both versions and they perform very similarly. (And gcc with -O3 emits branching instructions for both and they slow down...)

The demo code is in https://github.com/kpedro88/XGBoost-FastForest/blob/aos3/benchmark/benchmark-01-aos.cpp and a script to run everything is https://gist.github.com/kpedro88/4f60741a1d3b8b2830d4447f239f13a2#file-run_bdt-sh. Godbolt links: FastForest: https://godbolt.org/z/3EosczYEf Mine: https://godbolt.org/z/nddbMWq4h Mine (branchless): https://godbolt.org/z/E5xrcYdMG

guitargeek / XGBoost-FastForest

Difficulty reproducing benchmarks #16