Add optimised max_element implementation as a specific case of nth_element with beam = 1

XapaJIaMnu commented 1 year ago

Description

Add optimised max_element implementation as a specific case of nth_element with n = 1

Depending on the compiler used, this should speed up beam search by a factor of 2 to 10. Synthetic benchmark can be found here https://github.com/XapaJIaMnu/maxelem_test A summary:

Cascade lake results

	GCC 11.2	clang 14	icc 2022.1.0
std::max_element	2.6696s	0.4221s	0.4662s
sequential	1.0831s	1.1924s	1.1472s
AVX512 max + max_reduce	0.2412s	0.2152s	0.2142s
AVX512 max_reduce only	0.2570s	0.2629s	0.2325s
AVX512 cmp_ps_mask	0.1884s	0.1826s	0.1833s
AVX512 ^ + vectorized overhang	0.2097s	0.2089s	0.2072s
AVX cmp_ps + movemask	0.2181s	0.1697s	0.1702s
SSE cmplt_psp + movemask	0.2692s	0.2051s	0.2221s

Ryzen 9 5900HS results

	GCC 11.2	clang 14
std::max_element	2.4101s	0.7221s
sequential	0.3252s	0.3518s
AVX cmp_ps + movemask	0.1476s	0.1214s
SSE cmplt_psp + movemask	0.1693s	0.1468s

Added dependencies: none

How to test

Just load any model with the new code path and test it with beam size of 1. In our testing this reduced runtime by about 1%. I didn't run all regression tests because there's something broken in them right now.

Checklist

[x] I have tested the code manually
[ ] I have run regression tests
[x] I have read and followed CONTRIBUTING.md
[x] I have updated CHANGELOG.md

XapaJIaMnu commented 1 year ago

Apologies for the late reply @hieuhoang , I have some benchmarks, finally. I show a tiny11.tied.w configuration from WMT a few years ago, tested on 15.5k WMT sentences from the last many years (avg BLEU 36.{2,4,7} depending on configuration) with and without shortlist, using upstream fbgemm/intgemm/fp32 cpu backends). Here's the results:

tl;dr the beam1 code is 1-5 seconds faster depending on the test case. The bigger the output layer, the larger the difference.

	Beam1Opt	Master
AVX512, fbgemm, shortlist	94.02s	96.47s
AVX512, fbgemm, no shortlist	182.80s	188.05s
AVX2,fbgemm, shortlist	114.26s	115.75s
AVX2, fbgemm, no shortlist	195.78s	200.80s
AVX512, intgemm, shortlist	106.20s	108.71s
AVX512, intgemm, no shortlist	194.08s	200.36s
AVX2,intgemm, shortlist	119.08s	120.79s
AVX2, intgemm, no shortlist	200.11s	205.18s
AVX512, shortlist	118.29s	120.85s
AVX512, no shortlist	209.82s	216.63s
AVX2, shortlist	135.16s	136.96s
AVX2, no shortlist	215.94s	221.30s

To download the models and test for yourself, please get this tarball https://nbogoychev.com/files/speedtest.tar.gz The avx512 results are on a one core Cascade lake CPU, and the avx2 is a one core Kaby lake.

hieuhoang commented 1 year ago

I have no objections to approving the PR. Nick's results show a slight improvement for his model. My result, below, show hardly any change. Inching forward.

| master | nick's max_element -- | -- | -- Big Machine | 52.85 | 51.74 | 53 | 52.49 | 51.11 | 52.4 +bin model | 154.16 | 154.2 | 150.89 | 153.32 | 150.77 | 155.23 base+LSH | 135.02 | 137.37 | 135.27 | 136.12 | 133.43 | 135.39 bin model+LSH | 52.79 | 53.34 | 53.08 | 52.58 | 52.89 | 54.04 | | bin model, beam2 | 36.12 | 35.51 bin model, beam3 | 40.43 | 40.73 bin model, beam4 | 62.01 | 62.77 | |

XapaJIaMnu commented 1 year ago

I'd like to do some more tests first. At the very least, can you reproduce the results on my model? (I have packaged runme scripts, just paths to marian branches need to be adjusted)

hieuhoang commented 1 year ago

I may do when I have time. But your results are your own. I review PR to protect results on my models

XapaJIaMnu commented 1 year ago

Extra findings: I tested on a proper student model and the conclusion is that my changes don't work when using LSH, but they do consistently offer better performance in all other cases. How different is the LSH output layer compared to a shortlisted output layer?

XapaJIaMnu commented 1 year ago

I am getting some misaligned memory accesses from the LSH scores, and they take more cycles to load, that could be one of the issues..?

hieuhoang commented 1 year ago

LSH needs to be used with output layer without bias, shortlist doesn't have that restriction.

Not sure where misalign memory is coming from, if the LSH params is set to an aligned-friendly values, eg --output-approx-knn 128 1024, you should have mem alignments.

if max_element increase speed by 5% without LSH/shortlist, then not surprised that there's no noticeable affect when using LSH/shortlist since the vocab size you need to find the max is much smaller

XapaJIaMnu commented 1 year ago

I see improvements with word alignment based shortlist, but not with LSH based shortlist, where I am consistently slower. I also get misaligned addresses only when using LSH based shortlist.

How big are your output layers typically? I used 50 50 shortlist for my previous test.

I can't get improvement with 100 1024 LSH. What settings do you use?

When it comes to alignment, I'd expect the array to be 256 aligned at the start, i don't care about the end as I don't attempt to vectorise the overhang.

marian-nmt / marian-dev