Closed Congyuwang closed 1 year ago
Well. This perhaps make the generated assembly cleaner and easier to understand (when analyzing perf). But the added flag might be confusing though.
Hi Congyu,
Thanks for your advice and detailed explanation! We will consider putting the compiler options to our documents and programs later since we are not quite familiar with the details yet.
0.5 extra credits for such a good advice!
Another 1.0 extra credits for your PR #48
Wow. Thanks for this pleasant surprise! 🤗
Well. This perhaps make the generated assembly cleaner and easier to understand (when analyzing perf). But the added flag might be confusing though.
Yes, it seems that there is a trade-off between number of instructions and the complexity or extra effort spent on programming and compiling. We need further investigation to make the final decision.
Currently, for mm256 loads and stores, gcc generate two 128bit instructions for each load and store. This is suboptimal for most newer intel chips. See: https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd
For example in project2 SIMD:
generates
which can be just a single instruction
I suggest adding
-mno-avx256-split-unaligned-load
and-mno-avx256-split-unaligned-store
in addition to-mavx2
.