r372893 hangs when building the Linux kernel with CFI

samitolvanen commented 5 years ago


Bugzilla Link	43455
Resolution	LATER
Resolved on	Sep 27, 2019 15:29
Version	trunk
OS	Linux
CC	@bjope,@zmodem,@jgalenson,@nathanchance,@nickdesaulniers,@pcc,@rui314,@smithp35

Extended Description

Starting with r372893 ("[CodeGen] Replace -max-jump-table-size with -max-jump-table-targets"), LLD hangs when building the Linux kernel with CFI.

Steps to reproduce:

$ git clone https://github.com/samitolvanen/linux.git $ cd linux $ git checkout clang-cfi $ export PATH=/path/to/llvm/bin:$PATH $ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- CC=clang LD=ld.lld defconfig $ ./scripts/config -e LTO_CLANG -e THINLTO -e CFI_CLANG $ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- CC=clang LD=ld.lld -j110 (press enter a few times to accept default configs)

With r372892, this completes in ~3 minutes on my system. With r372893, LLD hangs when linking vmlinux.o (waited for >1h).

jgalenson commented 5 years ago

Do you think that you can attach the IR for the switch case, please?

I have a reduced testcase at https://gist.github.com/jgalenson/fda4b3168cb5d57dfb6a12ac771e3209. It's still too large to be made a regression test, but it should help you if you want to re-introduce this patch.

I can reproduce the issue by running:

gzip -d minimized.ll.gz llc minimized.ll -O2 -march arm64 -o minimized.out

Without this patch or with both it and my fix, this finishes in well under a minute for me. With just the patch, it was still running after five minutes.

I hope this helps if anyone keeps looking at this.

llvmbot commented 5 years ago

At this point, with the reversion of the patch, I consider this issue fixed.

samitolvanen commented 5 years ago

Sami, did you happen to get the function that this switch was contained in?

Joel confirmed that it's the compiler-generated __cfi_check function, which is a couple of megabytes of IR.

nickdesaulniers commented 5 years ago

OK, it looks like SwitchCG::SwitchLowering::findJumpTables is extremely slow with this patch. When compiling the kernel with CFI, it's called with a CaseClusterVector that has 8247 entries, which takes hours to complete.

Do you think that you can attach the IR for the switch case, please?

Sami, did you happen to get the function that this switch was contained in? In a debugger, for any given LLVM Instruction, you can call the .parent() method to get the Instruction's BasicBlock, then ->parent() again to get the BasicBlock's Function. (I'm not sure if inlining from LTO would obscure this at that point). If you need help, ping me off bug. Once we know the function, preprocess the source translation unit and attach it to the bug.

jgalenson commented 5 years ago

Fixed patch Here's a fixed version of my previous patch. Rewriting the code as I described earlier would still be cleaner (assuming it's correct).

If we end up re-merging the patch, please include something like this.

llvmbot commented 5 years ago

I tried to create a test case to reproduce this issue, but I couldn't notice any significant increase in compile time. It may have to do with the particular clustering of the kernel cases. Do you think that you can attach the IR for the switch case, please?

zmodem commented 5 years ago

True. The cleanest patch would I believe to remove the extra loop I added (and hence this removal) by reordering the existing j loop. Is that safe to do?

It's been a while since I was deeply into this code, but I don't think that works, I think it needs to start with a wide range and then consider narrower ones.

If not, we can convert this from a Set to a Map and decrement the count instead of removing an element.

Right, in effect using a multi-set (we'd also want to keep track of the total number of entries in it).

But at this point, I really wish we didn't need to add all this complexity (to something already complex) just for one or two specific sub-targets :-/

jgalenson commented 5 years ago

Created attachment 22582 [details] Patch that fixes building the Linux kernel

Attached is a proof-of-concept patch that fixes this bug for me. I removed the innermost loop from findJumpTables (specifically by removing its call to getJumpTableStats) by reusing earlier computations of the number of cases and targets. This should bring the runtime from O(N^3) to O(N^2). It could probably be improved, for example removing the extra loop I had to add by reordering the existing j loop order, but I wasn't sure how necessary the existing ordering was.

I don't think the computation of the number of unique targets is correct in your patch though. It starts off with Targets containing all unique targets in Clusters[i..N-1]. Then in the inner loop, you remove Clusters[j].MBB from the set. However, that block could also have been the target of another cluster that is still within i..j-1.

True. The cleanest patch would I believe to remove the extra loop I added (and hence this removal) by reordering the existing j loop. Is that safe to do? If not, we can convert this from a Set to a Map and decrement the count instead of removing an element.

zmodem commented 5 years ago

Created attachment 22582 [details] Patch that fixes building the Linux kernel

Attached is a proof-of-concept patch that fixes this bug for me. I removed the innermost loop from findJumpTables (specifically by removing its call to getJumpTableStats) by reusing earlier computations of the number of cases and targets. This should bring the runtime from O(N^3) to O(N^2). It could probably be improved, for example removing the extra loop I had to add by reordering the existing j loop order, but I wasn't sure how necessary the existing ordering was.

I don't think the computation of the number of unique targets is correct in your patch though. It starts off with Targets containing all unique targets in Clusters[i..N-1]. Then in the inner loop, you remove Clusters[j].MBB from the set. However, that block could also have been the target of another cluster that is still within i..j-1.

It would be great if we could get a stand-alone reproducer for this (a single clang or llc invocation), but in the meantime, I've reverted in r373060.

rui314 commented 5 years ago

As long as check-all passes, it is OK to consider that that doesn't regress. Could you add a new testcase that demonstrate the problem and upstream it?

jgalenson commented 5 years ago

Patch that fixes building the Linux kernel Attached is a proof-of-concept patch that fixes this bug for me. I removed the innermost loop from findJumpTables (specifically by removing its call to getJumpTableStats) by reusing earlier computations of the number of cases and targets. This should bring the runtime from O(N^3) to O(N^2). It could probably be improved, for example removing the extra loop I had to add by reordering the existing j loop order, but I wasn't sure how necessary the existing ordering was.

Note that I only tested this by using it to compile the Linux kernel and by running the check-all tests.

Feel free to improve this if you want, or I can submit it upstream.

llvmbot commented 5 years ago

Thanks for this information. I'll investigate it.

samitolvanen commented 5 years ago

OK, it looks like SwitchCG::SwitchLowering::findJumpTables is extremely slow with this patch. When compiling the kernel with CFI, it's called with a CaseClusterVector that has 8247 entries, which takes hours to complete.

samitolvanen commented 5 years ago

This patch is not even active for targets other than Exynos processors.

We have confirmed that reverting this patch fixes the issue.

It's also irrelevant to linking, though using LTO the compiler is invoked at link time again.

I agree, we are running into this issue at link time due to LTO. Note that we can only reproduce the problem when CFI is also enabled. LTO alone without CFI works fine.

I'm trying to reproduce this issue, but I'm tripping at missing requirements.

I'd be happy to help you sort this out. If you are using a Debian based distro, you'll also need Aarch64 binutils to compile the kernel. On my system, installing the gcc-aarch64-linux-gnu package pulls in all the necessary dependencies.

Can you please reduce it to a simple test case?

I'll update the bug if we are able to reproduce the issue with a smaller test case, but so far, we are only seeing this when compiling the kernel.

llvmbot commented 5 years ago

This patch is not even active for targets other than Exynos processors. It's also irrelevant to linking, though using LTO the compiler is invoked at link time again.

I'm trying to reproduce this issue, but I'm tripping at missing requirements. Can you please reduce it to a simple test case? If you run the linker under the debugger, you might be able to find out where it's stuck, the compiler command line and its input files.

Thank you.

samitolvanen commented 5 years ago

What is the lld version?

LLD was built from the same monorepo.

According to a report here, with r372893, it now takes more than two hours to link the kernel:

https://github.com/ClangBuiltLinux/linux/issues/714

rui314 commented 5 years ago

What is the lld version?

llvm / llvm-project

r372893 hangs when building the Linux kernel with CFI #42800

Extended Description