Expression balancing - Githubissues

thesps commented 2 weeks ago

The implementation of summation over tree scores in HLS backend prevents automatic expression balancing when saturation and rounding are used in the score precision type.

    Trees:
    for(int i = 0; i < n_trees; i++){
      Classes:
      for(int j = 0; j < fn_classes(n_classes); j++){
        score[j] += scores[i][j];
      }
    }

This should be replaced with a balanced tree reduce implementation.

francescobrivio commented 2 weeks ago

Hi @thesps!

I think I have a possible implementation of the solution in https://github.com/francescobrivio/conifer/commit/341627cc3f951d82642696c69b54aca7de422966 (but further testing is needed on this preliminary implementation!) which drastically reduce the latency in case the AP_SAT flag is used in the ScorePrecision. However this comes at the cost of a slight increase in resources usage due to the need of "inverting" scores[n_trees][fn_classes(n_classes)] (in order to apply the tree reduce method).

So I think this can potentially be further improved if we can change this line: https://github.com/thesps/conifer/blob/bfe3d186df60b95b045d81e68ea3dae36cabb3e6/conifer/backends/xilinxhls/firmware/BDT_unrolled.h#L103 to fill the scores in "reversed order", i.e. score_t scores[fn_classes(n_classes)][n_trees], but I'm not sure if this requires much deeper changes in the code and/or if it's acceptable :D

What do you think?

francescobrivio commented 2 weeks ago

I have run a few tests using the same identical xgb model and changing the Conifer "version" and the ScorePrecision:

Conifer version	ScorePrecision	vsynth LUT	vsynth FF	Latency
`master`	`ap_fixed<11,4,AP_RND_CONV,AP_SAT>`	51936	5032	133
My commit	`ap_fixed<11,4,AP_RND_CONV,AP_SAT>`	51163	4489	9
--------	--------	--------	--------	--------
`master`	`ap_fixed<11,4,AP_RND_CONV>`	43329	3542	4
My commit	`ap_fixed<11,4,AP_RND_CONV>`	42992	4504	5

So I have to correct my previous post: in case of AP_SAT my commit shows better latency and resources, while it has a slightly negative impact on both when AP_SAT is not used (to be improved).

Note: Please note that I still have to run the inference on few events to validate the conifer scores with respect to xgb.

thesps commented 2 weeks ago

Thanks for the development @francescobrivio this looks really promising! I think your proposal to swap the order of the indices of the scores array makes complete sense to avoid the inversion in silico and there should be no problem doing that.

As well as changing the BDT_rolled.cpp HLS code, you would need to change the Python writer for the fully unrolled optimization, that should be constrained to this line, I think:

newline += f'  scores[{it}][{ic}] = tree_{it}_{ic}.decision_function(x);\n'

francescobrivio commented 2 weeks ago

Thanks for the feedback Sioni! Ok I have implemented the "swapping" of the `scores` array and here are the updated results:	Conifer version	ScorePrecision	vsynth LUT	vsynth FF
`master`	`ap_fixed<11,4,AP_RND_CONV,AP_SAT>`	51936	5032	133
My commit	`ap_fixed<11,4,AP_RND_CONV,AP_SAT>`	51163	4489	9
My commit V2	`ap_fixed<11,4,AP_RND_CONV,AP_SAT>`	51163	4489	9
--------	--------	--------	--------	--------
`master`	`ap_fixed<11,4,AP_RND_CONV>`	43329	3542	4
My commit	`ap_fixed<11,4,AP_RND_CONV>`	42992	4504	5
My commit V2	`ap_fixed<11,4,AP_RND_CONV>`	42988	3800	4

Considerations:

In case of using AP_SAT: the result with the swapped scores is exactly identical...I guess Vitis is smart enough to figure it out?
In case of not using AP_SAT: avoiding to swap scores does bring improvements in both latency and resources
Overall with the latest changes ("V2") there is gain everywhere except for the usage of FF in no-AP_SAT case, but given FF are the most abundant resource I think we can afford it?

@thesps if you have further suggestions I'm happy to try them out! If not, and you agree, I can open a PR with these changes.

thesps commented 1 week ago

Nice, this looks great to me. Please go ahead and open a PR 👍

thesps commented 3 days ago

Resolved by @francescobrivio in #68

thesps / conifer

Expression balancing #66