trulyspinach / SMCAMDProcessor

Power management, monitoring and VirtualSMC plugin for AMD processors
BSD 3-Clause "New" or "Revised" License
1.04k stars 90 forks source link

Spinlock acquisition timeout crash #84

Open nehbit opened 4 years ago

nehbit commented 4 years ago

Hi there,

Thanks for this kext — both the app and power management has been super useful. I've just experienced a kernel panic that could have been caused by this, just wanted to pass over the crash report.

Potentially relevant circumstances:

Here's the report below. Hope it's useful.

*** Panic Report ***
Panic(CPU 2, time 1545753905538): NMIPI for spinlock acquisition timeout, spinlock: 0xffffff800eca02d0, spinlock owner: 0xffffff80b1ec3000, current_thread: 0xffffff80b1ec3000, spinlock_owner_cpu: 0x2
RAX: 0xffffff800ec0ee40, RBX: 0xffffff80a4a0ed20, RCX: 0x0000000000000000, RDX: 0x000003205df9c0b0
RSP: 0xffffff873b1ffe68, RBP: 0xffffff873b1ffea0, RSI: 0x00000321c2fcb40c, RDI: 0xffffff800ec0dbb0
R8: 0x00000000770119a1, R9: 0x0000000000000000, R10: 0x0000000000000000, R11: 0x0000000000000000
R12: 0x0000000000000000, R13: 0x0000000000000000, R14: 0xffffff800eb48e91, R15: 0x0000031fe6f84d88
RFL: 0x0000000000000046, RIP: 0xffffff800e4d57d0, CS: 0x0000000000000008, SS: 0x0000000000000010
Backtrace (CPU 2), Frame : Return Address
0xffffff873b1ffd00 : 0xffffff800e4db00b 
0xffffff873b1ffd60 : 0xffffff800e35be47 
0xffffff873b1ffea0 : 0xffffff800e4e4df3 
0xffffff873b1fff10 : 0xffffff800e4e4ad3 
0xffffff873b1fff50 : 0xffffff800e4e39d3 
0xffffff873b1fff70 : 0xffffff800e4db00b 
0xffffff873b1fffd0 : 0xffffff800e35bced 
0xffffff873bda34f0 : 0xffffff800e9f954b 
0xffffff873bda3580 : 0xffffff800e9f926f 
0xffffff873bda3620 : 0xffffff7f9141645e 
0xffffff873bda36c0 : 0xffffff7f9143c3ac 
0xffffff873bda3870 : 0xffffff7f914437b6 
0xffffff873bda3c20 : 0xffffff7f9145ab6a 
0xffffff873bda3ed0 : 0xffffff7f914421d0 
0xffffff873bda3ee0 : 0xffffff7f9143a5a6 
0xffffff873bda3f20 : 0xffffff7f9142e1e1 
0xffffff873bda3f40 : 0xffffff7f9142e190 
0xffffff873bda3f60 : 0xffffff7f9142e206 
0xffffff873bda3f80 : 0xffffff7f9142e47c 
0xffffff873bda3fa0 : 0xffffff800e35b0ce 
      Kernel Extensions in backtrace:
         com.apple.iokit.IOBluetoothFamily(6.0.14d3)[CDFAFD8A-7E75-3145-BC93-36196CEBA938]@0xffffff7f913f3000->0xffffff7f914ebfff
            dependency: com.apple.iokit.IOACPIFamily(1.4)[3E8E9870-2E41-33F6-ACBF-35DDC08F22B5]@0xffffff7f8f0c0000
            dependency: com.apple.iokit.IOUSBHostFamily(1.2)[BD06015A-414D-34DF-BE36-7B9034D7360D]@0xffffff7f8f639000
            dependency: com.apple.iokit.IOReportFamily(47)[0BC9F93B-456A-3D97-BE4C-69DCBB5E8A3C]@0xffffff7f8faf7000

BSD process name corresponding to current thread: kernel_task
Boot args: npci=0x2000 shikigva=80 

Mac OS version:
18G103

Kernel version:
Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64
Kernel UUID: C41337A1-0EC3-3896-A954-A1F85E849D53
Kernel slide: 0x000000000e000000
Kernel text base: 0xffffff800e200000
__HIB text base: 0xffffff800e100000
System model name: iMacPro1,1 (Mac-7BA5B2D9E42DDD94)

System uptime in nanoseconds: 1545753942672
last loaded kext at 717987179066: com.apple.filesystems.afpfs   11.1 (addr 0xffffff7f942fe000, size 356352)
loaded kexts:
(none)
mp_kdp_enter() timed-out on cpu 3, NMI-ing
Panic(CPU 0, time 1546253950445): NMIPI for spinlock acquisition timeout, spinlock: 0xffffff800eca02d0, spinlock owner: 0xffffff80b1ec3000, current_thread: 0xffffff80a382f520, spinlock_owner_cpu: 0x2
RAX: 0x0000000000000001, RBX: 0xffffff80a4a0ecc0, RCX: 0x0000000000000002, RDX: 0x000003205df9c100
RSP: 0xffffff871aedae70, RBP: 0xffffff871aedaea0, RSI: 0x00000321c2fcb344, RDI: 0xffffff800ec0dbb0
R8: 0x00000000770119a1, R9: 0x0000000000000000, R10: 0x0000000000000000, R11: 0x0000000000000000
R12: 0x0000000000000000, R13: 0x0000000000000000, R14: 0xffffff800eb48e91, R15: 0x0000031fe6f84cc0
RFL: 0x0000000000000002, RIP: 0xffffff800e4e5a39, CS: 0x0000000000000008, SS: 0x0000000000000010
Backtrace (CPU 0), Frame : Return Address
0xffffff871aedad10 : 0xffffff800e4db00b 
0xffffff871aedad70 : 0xffffff800e35be47 
0xffffff871aedaea0 : 0xffffff800e4e4df3 
0xffffff871aedaf10 : 0xffffff800e4e4ad3 
0xffffff871aedaf50 : 0xffffff800e4e39d3 
0xffffff871aedaf70 : 0xffffff800e4db00b 
0xffffff871aedafd0 : 0xffffff800e35bced 
0xffffffa76d99bf10 : 0xffffff800e4dd8ec 
0xffffffa76d99bf30 : 0xffffff800e3d0287 
0xffffffa76d99bf80 : 0xffffff800e3d0470 
0xffffffa76d99bfa0 : 0xffffff800e35b0ce 

BSD process name corresponding to current thread: kernel_task
Panic(CPU 5, time 1546253951537): NMIPI for spinlock acquisition timeout, spinlock: 0xffffff800eca02d0, spinlock owner: 0xffffff80b1ec3000, current_thread: 0xffffff80a48a1520, spinlock_owner_cpu: 0x2
RAX: 0xffffff80b1ec3000, RBX: 0xffffff800eca02d0, RCX: 0x0000000000000210, RDX: 0x0000032000000000
RSP: 0xffffff873bd07af0, RBP: 0xffffff873bd07b30, RSI: 0xffffff80a48a1520, RDI: 0xffffff800eca02d0
R8: 0x00000000770119a1, R9: 0x0000000000000002, R10: 0x0000000000000000, R11: 0x0000000000000000
R12: 0xffffff80a48a1520, R13: 0xffffff80b1ec3000, R14: 0x000003214bfc5f21, R15: 0x0000000000000000
RFL: 0x0000000000000097, RIP: 0xffffff800e3be412, CS: 0x0000000000000008, SS: 0x0000000000000010
Backtrace (CPU 5), Frame : Return Address
0xffffff873bd07990 : 0xffffff800e4db00b 
0xffffff873bd079f0 : 0xffffff800e35be47 
0xffffff873bd07b30 : 0xffffff800e3be5ba 
0xffffff873bd07b60 : 0xffffff800e4d07f5 
0xffffff873bd07ba0 : 0xffffff800e9f92f7 
0xffffff873bd07c30 : 0xffffff800e9fadd5 
0xffffff873bd07c70 : 0xffffff800ea1b1e2 
0xffffff873bd07dd0 : 0xffffff800ea1b155 
0xffffff873bd07e30 : 0xffffff7f94194650 
0xffffff873bd07e60 : 0xffffff800e4e59bc 
0xffffff873bd07ea0 : 0xffffff800e4e4df3 
0xffffff873bd07f10 : 0xffffff800e4e4ad3 
0xffffff873bd07f50 : 0xffffff800e4e39d3 
0xffffff873bd07f70 : 0xffffff800e4db00b 
0xffffff873bd07fd0 : 0xffffff800e35bced 
0xffffff873b61bf10 : 0xffffff800e4dd8ec 
0xffffff873b61bf30 : 0xffffff800e3d0287 
0xffffff873b61bf80 : 0xffffff800e3d0470 
0xffffff873b61bfa0 : 0xffffff800e35b0ce 
      Kernel Extensions in backtrace:
         wtf.spinach.AMDRyzenCPUPowerManagement(0.6.4)[1C8D6AF1-3F8A-3298-BCAE-5CAEE5D4A19B]@0xffffff7f9418e000->0xffffff7f941a1fff
            dependency: as.vit9696.Lilu(1.4.5)[E42CE60E-EC0B-33AE-A513-5383B81BF165]@0xffffff7f93ff9000
            dependency: com.apple.iokit.IOPCIFamily(2.9)[BFDEB8D4-50FE-3DDB-87B7-F6A504393830]@0xffffff7f8ec95000
            dependency: com.apple.kec.Libm(1)[EF5A0705-9EFD-3232-A6B3-B61DCA161EA7]@0xffffff7f8fc2a000

BSD process name corresponding to current thread: kernel_task
Panic(CPU 6, time 1546253951657): NMIPI for spinlock acquisition timeout, spinlock: 0xffffff800eca02d0, spinlock owner: 0xffffff80b1ec3000, current_thread: 0xffffff80a48a1000, spinlock_owner_cpu: 0x2
RAX: 0x0000000000000001, RBX: 0xffffff80a4a0ed20, RCX: 0x0000000000000002, RDX: 0x000003205df9c330
RSP: 0xffffff873bcebe60, RBP: 0xffffff873bcebe60, RSI: 0x00000321c2fcb68c, RDI: 0xffffff800ec0dbb0
R8: 0x00000000770119a1, R9: 0x0000000000000000, R10: 0x0000000000000000, R11: 0x0000000000000000
R12: 0x0000000000000000, R13: 0x0000000000000000, R14: 0xffffff800eb48e91, R15: 0x0000031fe6f85008
RFL: 0x0000000000000082, RIP: 0xffffff800e4d57eb, CS: 0x0000000000000008, SS: 0x0000000000000010
Backtrace (CPU 6), Frame : Return Address
0xffffff873bcebd00 : 0xffffff800e4db00b 
0xffffff873bcebd60 : 0xffffff800e35be47 
0xffffff873bcebe60 : 0xffffff800e4e5a37 
0xffffff873bcebea0 : 0xffffff800e4e4df3 
0xffffff873bcebf10 : 0xffffff800e4e4ad3 
0xffffff873bcebf50 : 0xffffff800e4e39d3 
0xffffff873bcebf70 : 0xffffff800e4db00b 
0xffffff873bcebfd0 : 0xffffff800e35bced 
0xffffff873b673f10 : 0xffffff800e4dd8ec 
0xffffff873b673f30 : 0xffffff800e3d0287 
0xffffff873b673f80 : 0xffffff800e3d0470 
0xffffff873b673fa0 : 0xffffff800e35b0ce 

BSD process name corresponding to current thread: kernel_task
Panic(CPU 1, time 1546253950555): NMIPI for spinlock acquisition timeout, spinlock: 0xffffff800eca02d0, spinlock owner: 0xffffff80b1ec3000, current_thread: 0xffffff80a48ae000, spinlock_owner_cpu: 0x2
RAX: 0xffffff80b1ec3000, RBX: 0xffffff800eca02d0, RCX: 0x000000000000002c, RDX: 0x0000032000000000
RSP: 0xffffff873bcffaf0, RBP: 0xffffff873bcffb30, RSI: 0xffffff80a48ae000, RDI: 0xffffff800eca02d0
R8: 0x00000000770119a1, R9: 0x0000000000000000, R10: 0x0000000000000000, R11: 0x0000000000000000
R12: 0xffffff80a48ae000, R13: 0xffffff80b1ec3000, R14: 0x000003214bfc47d9, R15: 0x0000000000000000
RFL: 0x0000000000000087, RIP: 0xffffff800e3be410, CS: 0x0000000000000008, SS: 0x0000000000000010
Backtrace (CPU 1), Frame : Return Address
0xffffff873bcff990 : 0xffffff800e4db00b 
0xffffff873bcff9f0 : 0xffffff800e35be47 
0xffffff873bcffb30 : 0xffffff800e3be5ba 
0xffffff873bcffb60 : 0xffffff800e4d07f5 
0xffffff873bcffba0 : 0xffffff800e9f92f7 
0xffffff873bcffc30 : 0xffffff800e9fadd5 
0xffffff873bcffc70 : 0xffffff800ea1b1e2 
0xffffff873bcffdd0 : 0xffffff800ea1b155 
0xffffff873bcffe30 : 0xffffff7f94194650 
0xffffff873bcffe60 : 0xffffff800e4e59bc 
0xffffff873bcffea0 : 0xffffff800e4e4df3 
0xffffff873bcfff10 : 0xffffff800e4e4ad3 
0xffffff873bcfff50 : 0xffffff800e4e39d3 
0xffffff873bcfff70 : 0xffffff800e4db00b 
0xffffff873bcfffd0 : 0xffffff800e35bced 
0xffffff873b0abf10 : 0xffffff800e4dd8ec 
0xffffff873b0abf30 : 0xffffff800e3d0287 

EOF
Model: iMacPro1,1, BootROM 1037.120.87.0.0, 8 processors, 3.99 GHz, 64 GB, SMC 
Graphics: Radeon RX Vega 64, Radeon RX Vega 64, spdisplays_pcie_device, 8 GB
Memory Module: BANK 0/DIMM0, 16 GB, DDR4, 2933 MHz, G-Skill, F4-3000C16-16GTZR
Memory Module: BANK 1/DIMM0, 16 GB, DDR4, 2933 MHz, G-Skill, F4-3000C16-16GTZR
Memory Module: BANK 2/DIMM0, 16 GB, DDR4, 2933 MHz, G-Skill, F4-3000C16-16GTZR
Memory Module: BANK 3/DIMM0, 16 GB, DDR4, 2933 MHz, G-Skill, F4-3000C16-16GTZR
AirPort: spairport_wireless_card_type_airport_extreme (0x14E4, 0x133), Broadcom BCM43xx 1.0 (7.77.61.2 AirPortDriverBrcmNIC-1305.8)
Bluetooth: Version 6.0.14d3, 3 services, 27 devices, 1 incoming serial ports
Network Service: Ethernet, Ethernet, en0
Serial ATA Device: WDC WDS100T2B0B, 1 TB
Serial ATA Device: SanDisk SDSSDH3 1T02, 1.02 TB
USB Device: USB 3.0 Bus
USB Device: 4-Port USB 3.0 Hub
USB Device: 4-Port USB 2.0 Hub
USB Device: Magic Keyboard with Numeric Keypad
USB Device: USB2.0 Hub
USB Device: USB Receiver
USB Device: USB 3.1 Bus
USB Device: Bluetooth USB Host Controller
USB Device: USB 3.1 Bus
USB Device: Cam Link 4K
Thunderbolt Bus: 
aluveitie commented 4 years ago

Thanks for the detailed report. I've run into the same issue, 15-30min after sleep MacOS completely froze. #70 fixed that issue for me, no panic in the last two weeks.

nehbit commented 4 years ago

Thank you! Yes, you're correct, I'm also having the complete freeze, updated the report as such. I'll try and report back whether it makes a difference.

nehbit commented 4 years ago

I've compiled your sleep-restore-state branch and it has not crashed after sleep for the past hour just yet. However, the AMDPowerGadget says No AMDRyzenCPUPowerManagement Found! and quits. Is this supposed to happen, or did I make a mistake in the compile process? I don't usually use Xcode or Swift, so it's quite possible that I did.

aluveitie commented 4 years ago

All fine, the branch is based on an older state that is probably not compatible with the latest version of the gadget. I can update it tomorrow.

nehbit commented 4 years ago

Thank you! It'd be great — the power gadget is quite nice to have access to.

aluveitie commented 4 years ago

I've rebased the pull request to latest master. With that the latest AMDPowerGadget should work again.

nehbit commented 4 years ago

I got a chance to compile + run your rebased version — for some reason it still doesn't appear to be compatible with the app, not entirely sure why. (I pulled the updates to the branch on your repo, not sure if there's a way to pull directly from a pull request)

aluveitie commented 4 years ago

Right, the last release is a little behind master. I'm also using a modified version of the App. You could just build the App from latest master/my branch too, that will work.

nehbit commented 4 years ago

Good idea, I should do that.

By the way, I discovered that the 'Automatic Speed Adjustment' setting on the app shaves a good 150 points off the machine's Cinebench R15 score (Enabled: 1570~, disabled, 1750~). Reading the description a little closely, it does make sense, but it's a little unintuitive considering that it does not appear to actually slow down my 2700x at idle — the idle power consumption is around 30W, regardless of it is on or off. This is with the vanilla, non-patched version, not with the pull request.

aluveitie commented 4 years ago

The difference is small in idle as long as core performance boost is disabled. In idle my CPU at 33C independent of P-State, but if I've CPB enabled then it gets up to 50C in P0 without any load. In that case the power adjustment effectively disables boosting when idle.

It lowers benchmark scores a little yet it does not really affect real world usage. But it keeps the fans from spinning up all the time with CPB enabled.

nehbit commented 4 years ago

Interesting. I have CPB and CPB Overdrive enabled and maxed out. My idle temp is 40-41C, at which point fans are at the lowest possible rate. I've also set the CPU and chassis fans to ramp up aggressively, so at full load the CPU is at 145W with peaks at 155W, and the temperature stabilises around 74C at 100% on all cores.

My work involves hammering the CPU with hundreds of compiles every day, so even shaving a few seconds off each is meaningful improvement.

Compared to the 2015 Macbook Pro I was using just two weeks ago it's definitely been worth the pain getting it working, though. 🙂

trulyspinach commented 4 years ago

There should not be any difference for your compilation jobs. Cinebench is very sensitive when your score is high enough. If I were you I would not worry too much about few hundred difference in Cinebench score when it is already above 7000. In fact I feel like compilation will most likely cap by IO unless every single of your source files is 10000 lines long :). The current configuration of this kext will for sure put all of your cores into the highest power mode when you compiling source codes. Just make sure you also have CPB enabled.

nehbit commented 4 years ago

I did some totally non-scientific testing with a part of our pipeline. Here are the results:

Webpack at High

88558ms
87616ms
88848ms

88340 avg

= 100% speed

Webpack at Auto

87455ms
88303ms
90114ms

88624 avg

= 99.6% speed

Difference

High is faster by
%0.4

In essence, you're right. Subjectively speaking, having it on High all the time does feel slightly more responsive but I'm almost certain that's placebo.

Edit / more context: What is of note, though, is that my Macbook Pro 2015 with i7-4870HQ is also completing the task at around 90 to 100 seconds bracket as well, but the systems are worlds apart in responsiveness in almost any other task, or when doing other tasks while this compile is going.

Interestingly, Ryzen retains most of its snappiness from its 'unloaded' state even when I'm running make as another part of our compile pipeline on 16 logical cores at 100% utilisation, while doing the same on my MBP2015 at 8 logical cores largely renders the machine unusable for the duration, gasping for air. I guess raw clock-speed is not the only difference between those cores — for the lack of a better word, Ryzen feels like it has much more breathing space in its compute pipeline than my laptop, even if on paper, on a single-core task they perform almost exactly the same.

trulyspinach commented 4 years ago

It is too hard to control other factors. OS can schedule some other task to do at any time. And a higher CPU temp could also have negative effect on CPB performance.

When you've mention compiling thing I thought it is CMake projects, etc. I don't have many experience with Webpack but from my limited experience i believe that it was running mostly in a single thread. And maybe it will fetch something from internet that will slow down the speed? You will see a huge different if you compile some big C project like XNU kernel :) That is the main reason I love this new Ryzen processor so much.

nehbit commented 4 years ago

Yes, Webpack is entirely single threaded but it doesn’t download things in the compile process. It’s only a party of our compile process though, we have a lot of native binaries emitted from make / clang as well.

nehbit commented 4 years ago

Bringing this issue back to the spinlock crash I've been experiencing, I've discovered that the reason why the app wasn't seeing the kext isn't the version mismatch ... but because the compiled kext wasn't being loaded. /facepalm.

@aluveitie, do you have any idea why the kext would not load? It compiles fine with no errors, and appears to generate well-formed build artifacts. It just seems that they're not being loaded when fed into the appropriate EFI/Clover folder.

I've also managed to pinpoint this spinlock issue (which happened for the 6th time today, I have the crash reports if needed) to it happening after a sleep, and the pair of kexts from this repo being loaded. If they are not, the spinlock crash does not happen, with or without sleep. If they are, it always happens, though the time it happens varies fairly wildly, between 0 to ~3 hours after wake from sleep. It you wake the computer in the morning from sleep, within the day it will certainly crash once with a full screen freeze due to spinlock, usually before noon.

To make the kexts load I checked their code signatures, and tried them with their code signatures removed as well. No dice, the compiled kexts are not being loaded. Any ideas why they might not be loaded? I'd normally check logs, but I don't know where the logs go for something that happens this early in the boot process, or even if the logs for this exists at all.

aluveitie commented 4 years ago

@nehbit I never used Clover so I have no idea. It always worked for me with OpenCore.

nehbit commented 4 years ago

I think it's probably something with my compile not being right because the releases from this (non-forked) repository work fine, it's just the ones I compile that have the issue. I'll spend some more time working on it I guess — could be something like a Lilu version mismatch. If you're inclined, it'd be super useful to have a compiled version of both kexts as a Github release on your fork, or to have this pull request merged so we can get the fix into the official repository and have official builds.

Anyway, thanks for all the help!

aluveitie commented 4 years ago

You could try mine: AMDRyzenCPUPowerManagement.kext.zip

(it also contains some minor modifications to the power management algorithm/default settings but that should not make any difference)

nehbit commented 4 years ago

I got yours in and it definitely loads now. Thank you. I'll report back in a few days to see if I've seen any more spinlock crashes. Should be fairly fast, I usually get one every day.

One interesting side effect is that after installing your kext, my CPU for the first time started to go below 3.7ghz — its idle power consumption dropped from 30~Watts to 13. Sweet! Though I'm not sure why this was not working with the unmodified kext.

aluveitie commented 4 years ago

My kext has automatic speed adjustment and lower power state enabled and CPB disabled by default.

nehbit commented 4 years ago

Interesting, then it's definitely working. I had the lower power states enabled already on the unmodified version but it had not made a difference. This time though, it actually does work. Oh well — let's see how this fares, shouldn't take much longer to figure out if it helps with the spinlock issues.

nehbit commented 4 years ago

Reporting back after two days and several sleep cycles. No spinlock crash since then, no performance regressions, everything seems fine, while the non-modified kext is reliably crashing after sleep. I recommend this pull request (#70) be merged if there are no other concerns.

Here are my system details:

aluveitie commented 4 years ago

@trulyspinach Since we didn't hear anything back it looks like with that pull requests it's working much better. Any chance to get that in?

trulyspinach commented 4 years ago

@aluveitie Sorry for the delay, I just have been very busy these days. I promise I will look into the merge ASAP :)

borygo77 commented 4 years ago

@nehbit how did you hot this working with Clover? Can I have a look on your efi folder please? I just can't get it to work with clover despite system runs totally fine 🤔 Sorry to steal your topic 😉