Closed jcai19 closed 4 years ago
Cherry-picked to 11.x as 410b0dc84bbdafabe3a2c3eedd96e50340a6e0d0 and 6250d4944539f67d6a605928e97c087fe306a79e. Thanks everyone!
Manoj notes that the fix will depend on
commit a0385bd7acd6 ("[llvm] Add contains(KeyType) -> bool methods to SmallPtrSet")
which doesn't look like it's in release/11.x to me.
So we'll likely want to cherry pick the following 2 patches to close this out:
1: commit a0385bd7acd6 ("[llvm] Add contains(KeyType) -> bool methods to SmallPtrSet")
2: commit f7a53d82c090 ("#47468 : Fix findPHICopyInsertPoint, so that copies aren't incorrectly inserted after an INLINEASM_BR.")
Manoj reports the CrOS tests look good:
Nathan reports clang-11+patch and ToT+patch look good:
tot: https://paste.myself5.de/raw/aciqosofos clang-11: https://del.dog/raw/xapegagywo (tot showed one failure, but that's due to a missed cherry pick for an unrelated breakage upstream in kernel sources)
The fixed was merged on master in: https://reviews.llvm.org/rGf7a53d82c0902147909f28a9295a9d00b4b27d38
Let's get this cherry-picked over to release/11.x and close this out. Hans?
Fix committed to master as f7a53d82c0902147909f28a9295a9d00b4b27d38, which cherry-picks cleanly to 11.x.
@Hans: shall I push to 11.x?
All of my kernel build+boot tests with release/11.x + https://reviews.llvm.org/D87865 passed.
All of my tests with this on ToT LLVM look good. Going to rerun them again with this on top of release/11.x.
The verification I've done so far is limited to one test device. Will start a more complete test with the patch.
I verified https://reviews.llvm.org/D87865 fixed the issue locally.
Thanks!
Jian and Manoj are testing this on CrOS kernels. Nathan is testing this on various linux kernel distro's configs. I'm testing this with kernel patches that use asm goto w/ outputs, and our 65 various kernel tree and configs: https://travis-ci.com/github/ClangBuiltLinux/continuous-integration.
It will take me at least an hour and a half to two hours to get through all of those.
Hans, I suspect we can have the patch ready and well qualified by tonight (US time, maybe next 6-8 hours).
I've got the "Slightly better fix" posted for review now: https://reviews.llvm.org/D87865
Please verify this works for you as well, thanks!
Thanks for the fix. I've verified locally it fixed the issue in a test device.
The bug is in llvm::findPHICopyInsertPoint.
We're correctly hitting the special case for placement in a block with asm-goto or invoke. Then, the code places the copy after the last use or def of the register. Unfortunately, placing after the last use is wrong -- inlineasm_br uses the register, in this example. Yet, the whole point of the special-case is to find a place to insert the copy before that instruction.
I believe just placing the copy after last DEF, not after the last use or def, will be a correct fix. E.g.:
--- a/llvm/lib/CodeGen/PHIEliminationUtils.cpp +++ b/llvm/lib/CodeGen/PHIEliminationUtils.cpp @@ -34,7 +34,7 @@ llvm::findPHICopyInsertPoint(MachineBasicBlock MBB, MachineBasicBlock SuccMBB, // Discover any defs/uses in this basic block. SmallPtrSet<MachineInstr*, 8> DefUsesInMBB; MachineRegisterInfo& MRI = MBB->getParent()->getRegInfo();
However, that will have the effect of placing the copy earlier than necessary in the block, which might increase register pressure. I'm going to try to make a slightly better fix.
fwiw, dumping the output of llc -O2 -print-before-all repro.ll -o - to a file and comparing that before vs after the changes in question, things start to go visibly wrong after
Eliminate PHI nodes for register allocation
pass. We have:
(so we were doing mov then inline asm blob; after we have those reversed, which is bad)
repro.ll smaller reproducer
$ llc -O2 repro.ll
Specifically: .Ltmp1: lock decl 192(%r14) je .Ltmp0
movq %r14, %rbx
It seems like the movq %r14, %rbx
would occur before the lock
'd decl
unconditionally prior to the patches in question in clang-11.
free_user_ns__user_namespace.ll Attaching reproducer extracted via:
$ llvm-extract --func=free_user_ns user_namespace.ll -o free_user_ns__user_namespace.ll
It looks like a BB terminated with callbr has an indirect target back to itself.
Will work on reducing this further via bugpoint or such.
Comparing the disassemblies between:
I can see a few differences, which look pretty obviously wrong to me, for example the relatively small fn free_user_ns
:
example:
- 2ab: 4c 89 f3 movq %r14, %rbx
- 2ae: f0 lock
- 2af: 41 ff 8e c0 00 00 00 decl 192(%r14)
- 2b6: 74 d2 je 0x28a <free_user_ns+0xa>
+ 2ab: f0 lock
+ 2ac: 41 ff 8e c0 00 00 00 decl 192(%r14)
+ 2b3: 74 d5 je 0x28a <free_user_ns+0xa>
+ 2b5: 4c 89 f3 movq %r14, %rbx
so the mov
was occurring unconditionally before the dec
, now it's conditional (oops) and after the dec
(double oops)
Since free_user_ns
is relatively small, should make the reproducer concise and getting it easy.
Specifically, the linux-4.4.y branch of the stable tree of the Linux kernel with CONFIG_USER_NS enabled; kernel/user_namespace.o seems to the the culprit of an object file bisection performed by Jian. Still digging in what's broken about it.
We were able to get physical access to the device.
We have bisected down to the object file affected.
I'm working through pinpointing where things are going wrong now.
As it's currently looking, I don't think we can hold the release for this.
We are still actively debugging this. We have found that the devices actually boot the first time but reboot when system UI is restarted for any reason e.g. tests. And the problem indeed starts from the mentioned LLVM patch.
Lack of access to a physical device has unfortunately made debugging really slow.
Is there any news on this?
< The second link doesn't have any test results? Or do you mean you manually tested it at that revision?
< Can you give some more detail? I'm not familiar with chrome's build system.
Sorry an example of the failure can be found as follows. Search for "System rolled back" in the log.
< Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output?
SG. I'm working with Nick on this.
The second link doesn't have any test results? Or do you mean you manually tested it at that revision?
Can you give some more detail? I'm not familiar with chrome's build system.
Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output?
Right, I applied https://reviews.llvm.org/rG60433c63acb71935111304d71e41b7ee982398f8 on http://crrev.com/c/2372837/14 and the issue still reproduced. Another later version LLVM that included 0433c63acb71935111304d71e41b7ee982398f8 also had the same boot failure http://crrev.com/c/2361864/2.
Are you saying that the issue described in https://github.com/ClangBuiltLinux/linux/issues/1085 was not actually fixed?
Or is this a distinct issue that didn't exhibit there for some reason?
60433c63acb71935111304d71e41b7ee982398f8 (Remove TwoAddressInstructionPass::sink3AddrInstruction. ) was supposed to be a fix but as per https://crbug.com/1123683 , the boot issue is still present after it.
assigned to @jyknight
Extended Description
While building with recent LLVM versions, we found X86 ChromeOS images with kernel 4.4 all failed to boot. We bisected it to the following commit.
commit 4b0aa5724feaa89a9538dcab97e018110b0e4bc3 Author: James Y Knight jyknight@google.com Date: Fri May 15 23:43:30 2020 -0400
Change the INLINEASM_BR MachineInstr to be a non-terminating instruction.
Link: https://crbug.com/1123683