Closed thesps closed 6 months ago
Hi @thesps!
I think I have a possible implementation of the solution in https://github.com/francescobrivio/conifer/commit/341627cc3f951d82642696c69b54aca7de422966 (but further testing is needed on this preliminary implementation!) which drastically reduce the latency in case the AP_SAT
flag is used in the ScorePrecision
.
However this comes at the cost of a slight increase in resources usage due to the need of "inverting"
scores[n_trees][fn_classes(n_classes)]
(in order to apply the tree reduce method).
So I think this can potentially be further improved if we can change this line:
https://github.com/thesps/conifer/blob/bfe3d186df60b95b045d81e68ea3dae36cabb3e6/conifer/backends/xilinxhls/firmware/BDT_unrolled.h#L103
to fill the scores
in "reversed order", i.e. score_t scores[fn_classes(n_classes)][n_trees]
, but I'm not sure if this requires much deeper changes in the code and/or if it's acceptable :D
What do you think?
I have run a few tests using the same identical xgb model and changing the Conifer "version" and the ScorePrecision:
Conifer version | ScorePrecision | vsynth LUT | vsynth FF | Latency |
---|---|---|---|---|
master |
ap_fixed<11,4,AP_RND_CONV,AP_SAT> |
51936 | 5032 | 133 |
My commit | ap_fixed<11,4,AP_RND_CONV,AP_SAT> |
51163 | 4489 | 9 |
-------- | -------- | -------- | -------- | -------- |
master |
ap_fixed<11,4,AP_RND_CONV> |
43329 | 3542 | 4 |
My commit | ap_fixed<11,4,AP_RND_CONV> |
42992 | 4504 | 5 |
So I have to correct my previous post: in case of AP_SAT
my commit shows better latency and resources, while it has a slightly negative impact on both when AP_SAT
is not used (to be improved).
Note: Please note that I still have to run the inference on few events to validate the conifer scores with respect to xgb.
Thanks for the development @francescobrivio this looks really promising! I think your proposal to swap the order of the indices of the scores
array makes complete sense to avoid the inversion in silico and there should be no problem doing that.
As well as changing the BDT_rolled.cpp
HLS code, you would need to change the Python writer for the fully unrolled optimization, that should be constrained to this line, I think:
newline += f' scores[{it}][{ic}] = tree_{it}_{ic}.decision_function(x);\n'
Thanks for the feedback Sioni!
Ok I have implemented the "swapping" of the scores array and here are the updated results: |
Conifer version | ScorePrecision | vsynth LUT | vsynth FF | Latency |
---|---|---|---|---|---|
master |
ap_fixed<11,4,AP_RND_CONV,AP_SAT> |
51936 | 5032 | 133 | |
My commit | ap_fixed<11,4,AP_RND_CONV,AP_SAT> |
51163 | 4489 | 9 | |
My commit V2 | ap_fixed<11,4,AP_RND_CONV,AP_SAT> |
51163 | 4489 | 9 | |
-------- | -------- | -------- | -------- | -------- | |
master |
ap_fixed<11,4,AP_RND_CONV> |
43329 | 3542 | 4 | |
My commit | ap_fixed<11,4,AP_RND_CONV> |
42992 | 4504 | 5 | |
My commit V2 | ap_fixed<11,4,AP_RND_CONV> |
42988 | 3800 | 4 |
Considerations:
AP_SAT
: the result with the swapped scores
is exactly identical...I guess Vitis is smart enough to figure it out?AP_SAT
: avoiding to swap scores
does bring improvements in both latency and resourcesAP_SAT
case, but given FF are the most abundant resource I think we can afford it?@thesps if you have further suggestions I'm happy to try them out! If not, and you agree, I can open a PR with these changes.
Nice, this looks great to me. Please go ahead and open a PR 👍
Resolved by @francescobrivio in #68
The implementation of summation over tree scores in HLS backend prevents automatic expression balancing when saturation and rounding are used in the score precision type.
This should be replaced with a balanced tree reduce implementation.