Improve trans and untrans with AVX512

kiyo-masui / bitshuffle

Filter for improving compression of typed binary data.

Other

219 stars 76 forks source link

Improve trans and untrans with AVX512 #117

Closed HackToday closed 2 years ago

HackToday commented 2 years ago

Signed-off-by: Wu, Kaiqiang kaiqiang.wu@intel.com Co-authored-by: vesslanjin jun.i.jin@intel.com

HackToday commented 2 years ago

With performance test against AVX2 and AVX512, I test against 4 byte elem, elem size varies from 8-120(incr step 8), Performance speedup ratio can be 0.94x~1.5x,
even in some cases not better than AVX2, it could keep nearly same performance. In summary, AVX512 could be a benefit for some modern platforms.

HackToday commented 2 years ago

@jrs65 and @kiyo-masui Could you help check if it is OK for such feature enablement for this repo?

HackToday commented 2 years ago

@jrs65 and @kiyo-masui please help check if missed

jrs65 commented 2 years ago

Hi @HackToday. Sorry for the belated response, it's been a busy end to the semester for myself (and Kiyo too I imagine).

Thanks for putting this together, it's definitely appreciated. Your code looks good to me, but I need to look around for an AVX512 machine for me to run the tests on as I think Github actions doesn't use any AVX512 supporting hosts.

Also, I'm intrigued if you have any benchmarks of this. How much does AVX512 support speed things up?

HackToday commented 2 years ago

hi @jrs65 Thanks for your reply.

For AVX512 available system, I tested against with PR changes, to count following

bshuf_trans_byte_elem_SSE
bshuf_trans_bit_byte_XXX (can be SSE, AVX, AVX512)

The tests show that total element size varies from 8-120(8, 16, 24, 32 etc. step 8, as Fig1 x label), 4 byte element. y label: AVX2 speed up vs SSE, AVX512 speed up vs SSE.

Performance speedup ratio can be 0.94x~1.5x,(AVX512 vs AVX2) Please check Fig1.

Fig 1

even in some cases not better than AVX2, it could keep nearly same performance.

Please let me know if need more info.

HackToday commented 2 years ago

@jrs65 has added one more improvement.(untrans part within bitshuffle), it is same usage like trans with AVX512. Also for 8 byte can have such following improvement.

(if with more large size can achieve more speedup ratio， reach to 1.5x)

HackToday commented 2 years ago

@jrs65 and @kiyo-masui in case anything missed. BTW, the workflows CI seems need approval to run.

jrs65 commented 2 years ago

Hi @HackToday

Thanks for all your efforts here, and apologies for the slow responses. I've got the code built and running on one of my own machines (the cluster we use has some AVX512 nodes), and on the machine that you gave me access to elsewhere. Everything seems to run fine, and with a nice speed boost.

I'm going to merge your code in now. I'll wait a few weeks to cut a release (mostly as I'm going on vacation) but also so I can see about merging in a few other outstanding PRs.

HackToday commented 2 years ago

Thanks @jrs65 for your time and help for the verification.