I hit some cases where LLVM spills mask registers to the stack in memory-bound code, where extra loads and stores to the stack are quite expensive. Some of these spills may be avoidable through better scheduling, but I think improving the performance of the spills (when unavoidable) would also be useful.
I noticed that GCC spills these registers to GPRs -- could LLVM do the same?
Here is a more realistic scenario, where better scheduling could theoretically eliminate the spills (but doesn't): https://godbolt.org/z/Pz1dsrh53
For the realistic scenario, I also observed that GCC spills to GPRs, while LLVM spills to the stack.
I chatted offline with @MatzeB, who indicated that this would probably require a lot of work in the register allocator. I understand if the gains aren't large enough to justify this work. But I still thought it might be useful to share this data point with the community.
I hit some cases where LLVM spills mask registers to the stack in memory-bound code, where extra loads and stores to the stack are quite expensive. Some of these spills may be avoidable through better scheduling, but I think improving the performance of the spills (when unavoidable) would also be useful.
I noticed that GCC spills these registers to GPRs -- could LLVM do the same?
Here is a minimal example:
* clang spills mask registers to the stack: https://godbolt.org/z/e9nGb6z8s
* gcc spills mask registers to GPRs: https://godbolt.org/z/h46cdje7a
Here is a more realistic scenario, where better scheduling could theoretically eliminate the spills (but doesn't): https://godbolt.org/z/Pz1dsrh53
For the realistic scenario, I also observed that GCC spills to GPRs, while LLVM spills to the stack.
I chatted offline with @MatzeB, who indicated that this would probably require a lot of work in the register allocator. I understand if the gains aren't large enough to justify this work. But I still thought it might be useful to share this data point with the community.
I hit some cases where LLVM spills mask registers to the stack in memory-bound code, where extra loads and stores to the stack are quite expensive. Some of these spills may be avoidable through better scheduling, but I think improving the performance of the spills (when unavoidable) would also be useful.
I noticed that GCC spills these registers to GPRs -- could LLVM do the same?
Here is a minimal example:
Here is a more realistic scenario, where better scheduling could theoretically eliminate the spills (but doesn't): https://godbolt.org/z/Pz1dsrh53
For the realistic scenario, I also observed that GCC spills to GPRs, while LLVM spills to the stack.
I chatted offline with @MatzeB, who indicated that this would probably require a lot of work in the register allocator. I understand if the gains aren't large enough to justify this work. But I still thought it might be useful to share this data point with the community.