Open XapaJIaMnu opened 1 year ago
Apologies for the late reply @hieuhoang , I have some benchmarks, finally. I show a tiny11.tied.w configuration from WMT a few years ago, tested on 15.5k WMT sentences from the last many years (avg BLEU 36.{2,4,7} depending on configuration) with and without shortlist, using upstream fbgemm/intgemm/fp32 cpu backends). Here's the results:
tl;dr the beam1 code is 1-5 seconds faster depending on the test case. The bigger the output layer, the larger the difference.
Beam1Opt | Master | |
---|---|---|
AVX512, fbgemm, shortlist | 94.02s | 96.47s |
AVX512, fbgemm, no shortlist | 182.80s | 188.05s |
AVX2,fbgemm, shortlist | 114.26s | 115.75s |
AVX2, fbgemm, no shortlist | 195.78s | 200.80s |
AVX512, intgemm, shortlist | 106.20s | 108.71s |
AVX512, intgemm, no shortlist | 194.08s | 200.36s |
AVX2,intgemm, shortlist | 119.08s | 120.79s |
AVX2, intgemm, no shortlist | 200.11s | 205.18s |
AVX512, shortlist | 118.29s | 120.85s |
AVX512, no shortlist | 209.82s | 216.63s |
AVX2, shortlist | 135.16s | 136.96s |
AVX2, no shortlist | 215.94s | 221.30s |
To download the models and test for yourself, please get this tarball https://nbogoychev.com/files/speedtest.tar.gz The avx512 results are on a one core Cascade lake CPU, and the avx2 is a one core Kaby lake.
I have no objections to approving the PR. Nick's results show a slight improvement for his model. My result, below, show hardly any change. Inching forward.
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
| master | nick's max_element -- | -- | -- Big Machine | 52.85 | 51.74 | 53 | 52.49 | 51.11 | 52.4 +bin model | 154.16 | 154.2 | 150.89 | 153.32 | 150.77 | 155.23 base+LSH | 135.02 | 137.37 | 135.27 | 136.12 | 133.43 | 135.39 bin model+LSH | 52.79 | 53.34 | 53.08 | 52.58 | 52.89 | 54.04 | | bin model, beam2 | 36.12 | 35.51 bin model, beam3 | 40.43 | 40.73 bin model, beam4 | 62.01 | 62.77 | |
Description
Add optimised max_element implementation as a specific case of nth_element with n = 1
Depending on the compiler used, this should speed up beam search by a factor of 2 to 10. Synthetic benchmark can be found here https://github.com/XapaJIaMnu/maxelem_test A summary:
Cascade lake results
Ryzen 9 5900HS results
Added dependencies: none
How to test
Just load any model with the new code path and test it with beam size of 1. In our testing this reduced runtime by about 1%. I didn't run all regression tests because there's something broken in them right now.
Checklist