marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
256 stars 126 forks source link

Add optimised max_element implementation as a specific case of nth_element with beam = 1 #981

Open XapaJIaMnu opened 1 year ago

XapaJIaMnu commented 1 year ago

Description

Add optimised max_element implementation as a specific case of nth_element with n = 1

Depending on the compiler used, this should speed up beam search by a factor of 2 to 10. Synthetic benchmark can be found here https://github.com/XapaJIaMnu/maxelem_test A summary:

Cascade lake results

GCC 11.2 clang 14 icc 2022.1.0
std::max_element 2.6696s 0.4221s 0.4662s
sequential 1.0831s 1.1924s 1.1472s
AVX512 max + max_reduce 0.2412s 0.2152s 0.2142s
AVX512 max_reduce only 0.2570s 0.2629s 0.2325s
AVX512 cmp_ps_mask 0.1884s 0.1826s 0.1833s
AVX512 ^ + vectorized overhang 0.2097s 0.2089s 0.2072s
AVX cmp_ps + movemask 0.2181s 0.1697s 0.1702s
SSE cmplt_psp + movemask 0.2692s 0.2051s 0.2221s

Ryzen 9 5900HS results

GCC 11.2 clang 14
std::max_element 2.4101s 0.7221s
sequential 0.3252s 0.3518s
AVX cmp_ps + movemask 0.1476s 0.1214s
SSE cmplt_psp + movemask 0.1693s 0.1468s

Added dependencies: none

How to test

Just load any model with the new code path and test it with beam size of 1. In our testing this reduced runtime by about 1%. I didn't run all regression tests because there's something broken in them right now.

Checklist

XapaJIaMnu commented 1 year ago

Apologies for the late reply @hieuhoang , I have some benchmarks, finally. I show a tiny11.tied.w configuration from WMT a few years ago, tested on 15.5k WMT sentences from the last many years (avg BLEU 36.{2,4,7} depending on configuration) with and without shortlist, using upstream fbgemm/intgemm/fp32 cpu backends). Here's the results:

tl;dr the beam1 code is 1-5 seconds faster depending on the test case. The bigger the output layer, the larger the difference.

Beam1Opt Master
AVX512, fbgemm, shortlist 94.02s 96.47s
AVX512, fbgemm, no shortlist 182.80s 188.05s
AVX2,fbgemm, shortlist 114.26s 115.75s
AVX2, fbgemm, no shortlist 195.78s 200.80s
AVX512, intgemm, shortlist 106.20s 108.71s
AVX512, intgemm, no shortlist 194.08s 200.36s
AVX2,intgemm, shortlist 119.08s 120.79s
AVX2, intgemm, no shortlist 200.11s 205.18s
AVX512, shortlist 118.29s 120.85s
AVX512, no shortlist 209.82s 216.63s
AVX2, shortlist 135.16s 136.96s
AVX2, no shortlist 215.94s 221.30s

To download the models and test for yourself, please get this tarball https://nbogoychev.com/files/speedtest.tar.gz The avx512 results are on a one core Cascade lake CPU, and the avx2 is a one core Kaby lake.

hieuhoang commented 1 year ago

I have no objections to approving the PR. Nick's results show a slight improvement for his model. My result, below, show hardly any change. Inching forward.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  | master | nick's max_element -- | -- | -- Big Machine | 52.85 | 51.74   | 53 | 52.49   | 51.11 | 52.4 +bin model | 154.16 | 154.2   | 150.89 | 153.32   | 150.77 | 155.23 base+LSH | 135.02 | 137.37   | 135.27 | 136.12   | 133.43 | 135.39 bin model+LSH | 52.79 | 53.34   | 53.08 | 52.58   | 52.89 | 54.04   |   |   bin model, beam2 | 36.12 | 35.51 bin model, beam3 | 40.43 | 40.73 bin model, beam4 | 62.01 | 62.77   |   |  

XapaJIaMnu commented 1 year ago

I'd like to do some more tests first. At the very least, can you reproduce the results on my model? (I have packaged runme scripts, just paths to marian branches need to be adjusted)

hieuhoang commented 1 year ago

I may do when I have time. But your results are your own. I review PR to protect results on my models

XapaJIaMnu commented 1 year ago

Extra findings: I tested on a proper student model and the conclusion is that my changes don't work when using LSH, but they do consistently offer better performance in all other cases. How different is the LSH output layer compared to a shortlisted output layer?

XapaJIaMnu commented 1 year ago

I am getting some misaligned memory accesses from the LSH scores, and they take more cycles to load, that could be one of the issues..?

hieuhoang commented 1 year ago

LSH needs to be used with output layer without bias, shortlist doesn't have that restriction.

Not sure where misalign memory is coming from, if the LSH params is set to an aligned-friendly values, eg --output-approx-knn 128 1024, you should have mem alignments.

if max_element increase speed by 5% without LSH/shortlist, then not surprised that there's no noticeable affect when using LSH/shortlist since the vocab size you need to find the max is much smaller

XapaJIaMnu commented 1 year ago

I see improvements with word alignment based shortlist, but not with LSH based shortlist, where I am consistently slower. I also get misaligned addresses only when using LSH based shortlist.

How big are your output layers typically? I used 50 50 shortlist for my previous test.

I can't get improvement with 100 1024 LSH. What settings do you use?

When it comes to alignment, I'd expect the array to be 256 aligned at the start, i don't care about the end as I don't attempt to vectorise the overhang.