llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.56k stars 11.81k forks source link

x86-domain-reassignment pass causes dreadful compile time slowdown when generating code for skylake-avx512 architecture #41517

Open llvmbot opened 5 years ago

llvmbot commented 5 years ago
Bugzilla Link 42172
Version trunk
OS Linux
Attachments the small case described above, the middle size case, the large size case
Reporter LLVM Bugzilla Contributor
CC @topperc,@RKSimon,@rotateright

Extended Description

At Unisys, we have an LLVM-based JIT to generate x86-64 code from instruction sequences from one of our historical architectures. We recently started testing on servers with the skylake-avx512 architecture. We encountered a shocking compile time slowdown when compiling for this target architecture. Using pass timings, we isolated the problems to the x86-domain-reassignment pass, by disabling this pass and seeing a large compile-time speedup. From the scatter plot of 31K+ examples, it clearly shows that an n-squared algorithm must be the culprit, since the slowdown is strongly related to the size of the IR. I provide three examples, a small one, a medium sized one, and a large one:

i131323820_f000203011542_1.bc - 712 object code bytes i141043225_f400004013147_1.bc - 40376 object code bytes i140409520_f400004002276_1.bc - 129424 object code bytes

The opt+llc pipeline that I ran for these cases showed the following slowdowns when the x86-domain-reassignment pass is used:

small case: 0.013 secs to 0.017 secs middle case: 1.043 secs to 7.981 secs large case: 4.140 secs to 69.689 secs

The opt+llc pipeline that I used for these comparisons looks like this:

$LLVMPATH/opt -O3 -enable-tbaa -mcpu=skylake-avx512 $INFILE | $LLVMPATH/llc -O3 -enable-tbaa -filetype=obj -o=out.o -mcpu=skylake-avx512 -

where $INFILE is one of the 3 files listed above. These commands generated the slow times noted above. The fast times were obtained by adding the -disable-x86-domain-reassignment option to the llc command above.

I measured the object size for each of the 31K+ bitcode files that we used for this experiment, and saw no changes to the object code size with/without the -disable-x86-domain-reassignment option, so I'm presuming that this extra compile time gives us no benefit.

Please let me know if I can provide any additional assistance in resolving this problem.

llvmbot commented 5 years ago

These experiments were run against LLVM 7.0.0, but I see no edits in the module lib/Target/X86DomainReassignment.cpp to expect any behavior change since then, so I posted it as a bug against the trunk. I searched for AVX512 problems and compile time problems in the bug list and didn't find any previous record of this one.
-Kevin

annamthomas commented 12 months ago

We found the same issue on some IRs, where llc in trunk takes about 18 seconds and 17 seconds of this is on the domainReassignment pass. This is on icelake machines where AVX-512 is run.

Looking at this pass, there are two O(n^2) algorithms. The first one where we compute the closure and the second one where we go over each closure checking if it profitable.

This does not scale well when we have large IRs. We have disabled this pass downstream, but is there anyway this pass can be improved for compile time? Or atleast maybe introduce a bailout?

goldsteinn commented 12 months ago

FWIW on modern machines like ICX domain switching penalties are mostly non-existent so the pass is also mostly gratuitous.

annamthomas commented 12 months ago

@topperc does this make sense to turn the pass off by default upstream? On older machines like skylake, avx-512 is mostly not recommended due to the downclocking issues and IIUC the above comment from @goldsteinn, on newer architectures, this pass shouldn't have much of an impact.

topperc commented 12 months ago

FWIW on modern machines like ICX domain switching penalties are mostly non-existent so the pass is also mostly gratuitous.

If I remember right this pass is trying to remove copies between K registers and GPRs by moving the computation to a consistent domain. I think primarily because all the intrinsics treat K registers as i8/i16/i32/i64 types. Are those copies free these days?

@topperc does this make sense to turn the pass off by default upstream? On older machines like skylake, avx-512 is mostly not recommended due to the downclocking issues and IIUC the above comment from @goldsteinn, on newer architectures, this pass shouldn't have much of an impact.

512 bit vectors are not recommended. But the 128 and 256 bit instructions in avx512vl and later are still useful.