Closed HackToday closed 2 years ago
With performance test against AVX2 and AVX512, I test against 4 byte elem, elem size varies from 8-120(incr step 8),
Performance speedup ratio can be 0.94x~1.5x,
even in some cases not better than AVX2, it could keep nearly same performance. In summary, AVX512 could be a benefit for some modern platforms.
@jrs65 and @kiyo-masui Could you help check if it is OK for such feature enablement for this repo?
@jrs65 and @kiyo-masui please help check if missed
Hi @HackToday. Sorry for the belated response, it's been a busy end to the semester for myself (and Kiyo too I imagine).
Thanks for putting this together, it's definitely appreciated. Your code looks good to me, but I need to look around for an AVX512 machine for me to run the tests on as I think Github actions doesn't use any AVX512 supporting hosts.
Also, I'm intrigued if you have any benchmarks of this. How much does AVX512 support speed things up?
hi @jrs65 Thanks for your reply.
For AVX512 available system, I tested against with PR changes, to count following
bshuf_trans_byte_elem_SSE
bshuf_trans_bit_byte_XXX (can be SSE, AVX, AVX512)
The tests show that total element size varies from 8-120(8, 16, 24, 32 etc. step 8, as Fig1 x label), 4 byte element. y label: AVX2 speed up vs SSE, AVX512 speed up vs SSE.
Performance speedup ratio can be 0.94x~1.5x,(AVX512 vs AVX2) Please check Fig1.
Fig 1
even in some cases not better than AVX2, it could keep nearly same performance.
Please let me know if need more info.
@jrs65 has added one more improvement.(untrans part within bitshuffle), it is same usage like trans with AVX512. Also for 8 byte can have such following improvement.
(if with more large size can achieve more speedup ratio, reach to 1.5x)
@jrs65 and @kiyo-masui in case anything missed. BTW, the workflows CI seems need approval to run.
Hi @HackToday
Thanks for all your efforts here, and apologies for the slow responses. I've got the code built and running on one of my own machines (the cluster we use has some AVX512 nodes), and on the machine that you gave me access to elsewhere. Everything seems to run fine, and with a nice speed boost.
I'm going to merge your code in now. I'll wait a few weeks to cut a release (mostly as I'm going on vacation) but also so I can see about merging in a few other outstanding PRs.
Thanks @jrs65 for your time and help for the verification.
Signed-off-by: Wu, Kaiqiang kaiqiang.wu@intel.com Co-authored-by: vesslanjin jun.i.jin@intel.com