thesps / conifer

Fast inference of Boosted Decision Trees in FPGAs
Apache License 2.0
40 stars 22 forks source link

Expression balancing #66

Closed thesps closed 3 days ago

thesps commented 2 weeks ago

The implementation of summation over tree scores in HLS backend prevents automatic expression balancing when saturation and rounding are used in the score precision type.

    Trees:
    for(int i = 0; i < n_trees; i++){
      Classes:
      for(int j = 0; j < fn_classes(n_classes); j++){
        score[j] += scores[i][j];
      }
    }

This should be replaced with a balanced tree reduce implementation.

francescobrivio commented 2 weeks ago

Hi @thesps!

I think I have a possible implementation of the solution in https://github.com/francescobrivio/conifer/commit/341627cc3f951d82642696c69b54aca7de422966 (but further testing is needed on this preliminary implementation!) which drastically reduce the latency in case the AP_SAT flag is used in the ScorePrecision. However this comes at the cost of a slight increase in resources usage due to the need of "inverting" scores[n_trees][fn_classes(n_classes)] (in order to apply the tree reduce method).

So I think this can potentially be further improved if we can change this line: https://github.com/thesps/conifer/blob/bfe3d186df60b95b045d81e68ea3dae36cabb3e6/conifer/backends/xilinxhls/firmware/BDT_unrolled.h#L103 to fill the scores in "reversed order", i.e. score_t scores[fn_classes(n_classes)][n_trees], but I'm not sure if this requires much deeper changes in the code and/or if it's acceptable :D

What do you think?

francescobrivio commented 2 weeks ago

I have run a few tests using the same identical xgb model and changing the Conifer "version" and the ScorePrecision:

Conifer version ScorePrecision vsynth LUT vsynth FF Latency
master ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51936 5032 133
My commit ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51163 4489 9
-------- -------- -------- -------- --------
master ap_fixed<11,4,AP_RND_CONV> 43329 3542 4
My commit ap_fixed<11,4,AP_RND_CONV> 42992 4504 5

So I have to correct my previous post: in case of AP_SAT my commit shows better latency and resources, while it has a slightly negative impact on both when AP_SAT is not used (to be improved).

Note: Please note that I still have to run the inference on few events to validate the conifer scores with respect to xgb.

thesps commented 2 weeks ago

Thanks for the development @francescobrivio this looks really promising! I think your proposal to swap the order of the indices of the scores array makes complete sense to avoid the inversion in silico and there should be no problem doing that.

As well as changing the BDT_rolled.cpp HLS code, you would need to change the Python writer for the fully unrolled optimization, that should be constrained to this line, I think:

newline += f'  scores[{it}][{ic}] = tree_{it}_{ic}.decision_function(x);\n'
francescobrivio commented 2 weeks ago
Thanks for the feedback Sioni! Ok I have implemented the "swapping" of the scores array and here are the updated results: Conifer version ScorePrecision vsynth LUT vsynth FF Latency
master ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51936 5032 133
My commit ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51163 4489 9
My commit V2 ap_fixed<11,4,AP_RND_CONV,AP_SAT> 51163 4489 9
-------- -------- -------- -------- --------
master ap_fixed<11,4,AP_RND_CONV> 43329 3542 4
My commit ap_fixed<11,4,AP_RND_CONV> 42992 4504 5
My commit V2 ap_fixed<11,4,AP_RND_CONV> 42988 3800 4

Considerations:

@thesps if you have further suggestions I'm happy to try them out! If not, and you agree, I can open a PR with these changes.

thesps commented 1 week ago

Nice, this looks great to me. Please go ahead and open a PR 👍

thesps commented 3 days ago

Resolved by @francescobrivio in #68