llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
26.76k stars 10.96k forks source link

AVX-512 mask registers spill to stack when GPRs are available #94025

Open embg opened 1 month ago

embg commented 1 month ago

I hit some cases where LLVM spills mask registers to the stack in memory-bound code, where extra loads and stores to the stack are quite expensive. Some of these spills may be avoidable through better scheduling, but I think improving the performance of the spills (when unavoidable) would also be useful.

I noticed that GCC spills these registers to GPRs -- could LLVM do the same?

Here is a minimal example:

Here is a more realistic scenario, where better scheduling could theoretically eliminate the spills (but doesn't): https://godbolt.org/z/Pz1dsrh53

For the realistic scenario, I also observed that GCC spills to GPRs, while LLVM spills to the stack.

I chatted offline with @MatzeB, who indicated that this would probably require a lot of work in the register allocator. I understand if the gains aren't large enough to justify this work. But I still thought it might be useful to share this data point with the community.

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-x86

Author: Elliot Gorokhovsky (embg)

I hit some cases where LLVM spills mask registers to the stack in memory-bound code, where extra loads and stores to the stack are quite expensive. Some of these spills may be avoidable through better scheduling, but I think improving the performance of the spills (when unavoidable) would also be useful. I noticed that GCC spills these registers to GPRs -- could LLVM do the same? Here is a minimal example: * clang spills mask registers to the stack: https://godbolt.org/z/e9nGb6z8s * gcc spills mask registers to GPRs: https://godbolt.org/z/h46cdje7a Here is a more realistic scenario, where better scheduling could theoretically eliminate the spills (but doesn't): https://godbolt.org/z/Pz1dsrh53 For the realistic scenario, I also observed that GCC spills to GPRs, while LLVM spills to the stack. I chatted offline with @MatzeB, who indicated that this would probably require a lot of work in the register allocator. I understand if the gains aren't large enough to justify this work. But I still thought it might be useful to share this data point with the community.