gcc7 split AVX2 load and store

tonyyxliu / CUHKSZ-CSC4005

Project Materials for CUHK(SZ) Course CSC4005: Parallel Programming

MIT License

79 stars 31 forks source link

gcc7 split AVX2 load and store #49

Closed Congyuwang closed 1 year ago

Congyuwang commented 1 year ago

Currently, for mm256 loads and stores, gcc generate two 128bit instructions for each load and store. This is suboptimal for most newer intel chips. See: https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd

For example in project2 SIMD:

__m256i m2_k8 = _mm256_loadu_si256(m2_k_j);

generates

vmovdqu (%rsi,%rdx,4),%xmm0                    │   load matrix2[k][j..] 0..4
vinserti128 $0x1,0x10(%rsi,%rdx,4),%ymm0,%ymm0 │   load matrix2[k][j..] 4..8

which can be just a single instruction

vmovdqu (%rsi,%rdx,4),%ymm0

I suggest adding -mno-avx256-split-unaligned-load and -mno-avx256-split-unaligned-store in addition to -mavx2.

Congyuwang commented 1 year ago

Well. This perhaps make the generated assembly cleaner and easier to understand (when analyzing perf). But the added flag might be confusing though.

tonyyxliu commented 1 year ago

Hi Congyu,

Thanks for your advice and detailed explanation! We will consider putting the compiler options to our documents and programs later since we are not quite familiar with the details yet.

0.5 extra credits for such a good advice!

tonyyxliu commented 1 year ago

Another 1.0 extra credits for your PR #48

Congyuwang commented 1 year ago

Wow. Thanks for this pleasant surprise! 🤗

tonyyxliu commented 1 year ago

Well. This perhaps make the generated assembly cleaner and easier to understand (when analyzing perf). But the added flag might be confusing though.

Yes, it seems that there is a trade-off between number of instructions and the complexity or extra effort spent on programming and compiling. We need further investigation to make the final decision.