llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.62k stars 11.83k forks source link

X86 ChromeOS devices with kernel 4.4 fails to boot after 4b0aa5724feaa89a9538dcab97e018110b0e4bc3 #46812

Closed jcai19 closed 4 years ago

jcai19 commented 4 years ago
Bugzilla Link 47468
Resolution FIXED
Resolved on Sep 22, 2020 02:45
Version trunk
OS All
Blocks llvm/llvm-project#4440 llvm/llvm-project#46070
CC @jcai19,@cmtice,@topperc,@dwblaikie,@zmodem,@jyknight,@lalozano,@m-gupta,@nathanchance,@nickdesaulniers,@stephenhines

Extended Description

While building with recent LLVM versions, we found X86 ChromeOS images with kernel 4.4 all failed to boot. We bisected it to the following commit.

commit 4b0aa5724feaa89a9538dcab97e018110b0e4bc3 Author: James Y Knight jyknight@google.com Date: Fri May 15 23:43:30 2020 -0400

Change the INLINEASM_BR MachineInstr to be a non-terminating instruction.

Link: https://crbug.com/1123683

zmodem commented 4 years ago

Cherry-picked to 11.x as 410b0dc84bbdafabe3a2c3eedd96e50340a6e0d0 and 6250d4944539f67d6a605928e97c087fe306a79e. Thanks everyone!

nickdesaulniers commented 4 years ago

Manoj notes that the fix will depend on

commit a0385bd7acd6 ("[llvm] Add contains(KeyType) -> bool methods to SmallPtrSet")

which doesn't look like it's in release/11.x to me.

So we'll likely want to cherry pick the following 2 patches to close this out:

1: commit a0385bd7acd6 ("[llvm] Add contains(KeyType) -> bool methods to SmallPtrSet")

2: commit f7a53d82c090 ("#47468 : Fix findPHICopyInsertPoint, so that copies aren't incorrectly inserted after an INLINEASM_BR.")

nickdesaulniers commented 4 years ago

Manoj reports the CrOS tests look good:

https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/2417398/3#message-18cdbb29241e515a94ffc182889a7c9c2b335ec7

Nathan reports clang-11+patch and ToT+patch look good:

tot: https://paste.myself5.de/raw/aciqosofos clang-11: https://del.dog/raw/xapegagywo (tot showed one failure, but that's due to a missed cherry pick for an unrelated breakage upstream in kernel sources)

The fixed was merged on master in: https://reviews.llvm.org/rGf7a53d82c0902147909f28a9295a9d00b4b27d38

Let's get this cherry-picked over to release/11.x and close this out. Hans?

jyknight commented 4 years ago

Fix committed to master as f7a53d82c0902147909f28a9295a9d00b4b27d38, which cherry-picks cleanly to 11.x.

@​Hans: shall I push to 11.x?

nickdesaulniers commented 4 years ago

All of my kernel build+boot tests with release/11.x + https://reviews.llvm.org/D87865 passed.

nickdesaulniers commented 4 years ago

All of my tests with this on ToT LLVM look good. Going to rerun them again with this on top of release/11.x.

jcai19 commented 4 years ago

The verification I've done so far is limited to one test device. Will start a more complete test with the patch.

jcai19 commented 4 years ago

I verified https://reviews.llvm.org/D87865 fixed the issue locally.

nickdesaulniers commented 4 years ago

Thanks!

Jian and Manoj are testing this on CrOS kernels. Nathan is testing this on various linux kernel distro's configs. I'm testing this with kernel patches that use asm goto w/ outputs, and our 65 various kernel tree and configs: https://travis-ci.com/github/ClangBuiltLinux/continuous-integration.

It will take me at least an hour and a half to two hours to get through all of those.

Hans, I suspect we can have the patch ready and well qualified by tonight (US time, maybe next 6-8 hours).

jyknight commented 4 years ago

I've got the "Slightly better fix" posted for review now: https://reviews.llvm.org/D87865

Please verify this works for you as well, thanks!

jcai19 commented 4 years ago

Thanks for the fix. I've verified locally it fixed the issue in a test device.

jyknight commented 4 years ago

The bug is in llvm::findPHICopyInsertPoint.

We're correctly hitting the special case for placement in a block with asm-goto or invoke. Then, the code places the copy after the last use or def of the register. Unfortunately, placing after the last use is wrong -- inlineasm_br uses the register, in this example. Yet, the whole point of the special-case is to find a place to insert the copy before that instruction.

I believe just placing the copy after last DEF, not after the last use or def, will be a correct fix. E.g.:

--- a/llvm/lib/CodeGen/PHIEliminationUtils.cpp +++ b/llvm/lib/CodeGen/PHIEliminationUtils.cpp @@ -34,7 +34,7 @@ llvm::findPHICopyInsertPoint(MachineBasicBlock MBB, MachineBasicBlock SuccMBB, // Discover any defs/uses in this basic block. SmallPtrSet<MachineInstr*, 8> DefUsesInMBB; MachineRegisterInfo& MRI = MBB->getParent()->getRegInfo();

However, that will have the effect of placing the copy earlier than necessary in the block, which might increase register pressure. I'm going to try to make a slightly better fix.

nickdesaulniers commented 4 years ago

fwiw, dumping the output of llc -O2 -print-before-all repro.ll -o - to a file and comparing that before vs after the changes in question, things start to go visibly wrong after

Eliminate PHI nodes for register allocation

pass. We have:

(so we were doing mov then inline asm blob; after we have those reversed, which is bad)

nickdesaulniers commented 4 years ago

repro.ll smaller reproducer

$ llc -O2 repro.ll

Specifically: .Ltmp1: lock decl 192(%r14) je .Ltmp0

NO_APP

    movq    %r14, %rbx

It seems like the movq %r14, %rbx would occur before the lock'd decl unconditionally prior to the patches in question in clang-11.

nickdesaulniers commented 4 years ago

free_user_ns__user_namespace.ll Attaching reproducer extracted via:

$ llvm-extract --func=free_user_ns user_namespace.ll -o free_user_ns__user_namespace.ll

It looks like a BB terminated with callbr has an indirect target back to itself.

Will work on reducing this further via bugpoint or such.

nickdesaulniers commented 4 years ago

Comparing the disassemblies between:

  1. commit 78c69a00a4cff786e0
  2. commit 78c69a00a4cff786e0 + cherry pick 4b0aa5724feaa + cherry pick 78c69a00a4cff

I can see a few differences, which look pretty obviously wrong to me, for example the relatively small fn free_user_ns:

example:

-     2ab: 4c 89 f3                             movq    %r14, %rbx
-     2ae: f0                                   lock
-     2af: 41 ff 8e c0 00 00 00                 decl    192(%r14)
-     2b6: 74 d2                                je      0x28a <free_user_ns+0xa>
+     2ab: f0                                   lock
+     2ac: 41 ff 8e c0 00 00 00                 decl    192(%r14)
+     2b3: 74 d5                                je      0x28a <free_user_ns+0xa>
+     2b5: 4c 89 f3                             movq    %r14, %rbx

so the mov was occurring unconditionally before the dec, now it's conditional (oops) and after the dec (double oops)

Since free_user_ns is relatively small, should make the reproducer concise and getting it easy.

nickdesaulniers commented 4 years ago

Specifically, the linux-4.4.y branch of the stable tree of the Linux kernel with CONFIG_USER_NS enabled; kernel/user_namespace.o seems to the the culprit of an object file bisection performed by Jian. Still digging in what's broken about it.

nickdesaulniers commented 4 years ago

We were able to get physical access to the device.

We have bisected down to the object file affected.

I'm working through pinpointing where things are going wrong now.

zmodem commented 4 years ago

As it's currently looking, I don't think we can hold the release for this.

m-gupta commented 4 years ago

We are still actively debugging this. We have found that the devices actually boot the first time but reboot when system UI is restarted for any reason e.g. tests. And the problem indeed starts from the mentioned LLVM patch.

Lack of access to a physical device has unfortunately made debugging really slow.

zmodem commented 4 years ago

Is there any news on this?

jcai19 commented 4 years ago

< The second link doesn't have any test results? Or do you mean you manually tested it at that revision?

< Can you give some more detail? I'm not familiar with chrome's build system.

Sorry an example of the failure can be found as follows. Search for "System rolled back" in the log.

https://00e9e64bac44fd8b1f073164b070e39c7d3b67d7d5b582cb85-apidata.googleusercontent.com/download/storage/v1/b/chromeos-autotest-results/o/swarming-4e5b5cee0a6c6a10%2Fprovision_chromeos6-row6-rack4-host11%2Fdebug%2Fautoserv.DEBUG?qk=AD5uMEs2QtbXhnrDyvEWABvDxkI31TZKCEntN_pMZC_YxVXitalTlHGSfO1zzo_oipdCAkEu61HjA6U3J8F7Ic7xGE9xfpc1eZ6PPWPbKnylwnx7qVIENybBBiRRXlmPaIBwFzWsapdcd5efVuhJeTND3ElAb_1zKOnAkN9aKrE6n1tnMXzA3iVNRSX_DsLyEfufkGd8MyE9J6Yy2Jyqx9WFyHleb-_e-Mzpj9B40zsrihD5YUX1Qd_Pwaoi_r3yCatzHTBDLMncK4Kd7pWYcVrI6i3cwsJONChaqC7iWFkCD9i572ctlCLZ3ARCpDKNVjENXtpV9awJGVihZrHEXS4rR-SsTO7dfqODQa7GxovexnOfTb9xqMDhjCuoZ4pqY6Vfcaw9vgfqm3dYFop2bhZWDiGTZOPJHLPc4MLUEuW0T7mxg59bQQxl47caJYP6579czEGIlGgojj6qn_GdLs2QwRwNOICUMcqYGmDGaCQZRdnw59xDLUHD9qWrR8lr-QORk86qNRnIdPlwa3eXQI7eXSm2MmvGyCTX-YS5Z_PSp2-qSirYAv5fIcOetZL12vV-0rjurU-TN3GJ9zxBuJqzbwZSvyVXQ8q5WwNWkia7a-oM_yRShQGBCnzN9xBKS_7B8a_m8TNdRSycpzKzlt6MNGAmGYz-oYDMmtEwRSysgoW8qRNZ_7pc8UHCnZ9NyrPKOzLjkk96pBcdQ1677n5rRRcuE68SGd3mHNUN8C5AGG6k-un_Av5iNMCX3sMeRHRn3eL27nFaHZGaTZ_PRqhLCdZd-B0i95_YKJnW1NkGa5QMBtnn1Um9XnFs4ZDVgf0Z0apiQ8aDPaRnrzWNTpwCiYgEL_PiSMTtQKXJ8Rz4zCOr0aLBSK2v51yxSkJPkser3lBFmsq3RtOyArhuZ4KuyhZJosDi1Wpi-Ti9DnzLq6zwRqt6Bms&isca=1

< Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output?

SG. I'm working with Nick on this.

jyknight commented 4 years ago

The second link doesn't have any test results? Or do you mean you manually tested it at that revision?

Can you give some more detail? I'm not familiar with chrome's build system.

Perhaphs you could follow the same steps Nick used in the bug I linked to, to find what changed in the assembly output?

jcai19 commented 4 years ago

Right, I applied https://reviews.llvm.org/rG60433c63acb71935111304d71e41b7ee982398f8 on http://crrev.com/c/2372837/14 and the issue still reproduced. Another later version LLVM that included 0433c63acb71935111304d71e41b7ee982398f8 also had the same boot failure http://crrev.com/c/2361864/2.

jyknight commented 4 years ago

Are you saying that the issue described in https://github.com/ClangBuiltLinux/linux/issues/1085 was not actually fixed?

Or is this a distinct issue that didn't exhibit there for some reason?

m-gupta commented 4 years ago

60433c63acb71935111304d71e41b7ee982398f8 (Remove TwoAddressInstructionPass::sink3AddrInstruction. ) was supposed to be a fix but as per https://crbug.com/1123683 , the boot issue is still present after it.

jcai19 commented 4 years ago

assigned to @jyknight