Closed marc-hb closed 2 years ago
this could be the result of an upstream merge. The call trace is new but in the past when we had rtcwake errors they seemed to be 100% correlated with this 'task refusing to freeze' issue.
yeah, this is not a new error, it's just that we can see it Commit 8412dbd642585107e29b20920876de98e7df5819 ('PM / sleep: Show freezing tasks that caused a suspend abort') added more information.
from the commit message it seems we have a task that is still freezing while we get a wake-up, a possible indication of a race condition.
This is excellent that we can trap this instead of seeing rtcwake fail miserably.
If I look at the call trace above, it seems the 'aplay' userspace task initiates an ioctl while we are trying to freeze all tasks and suspend, and that ioctl will try to bring the card back in D0 state.
that looks like a race condition to me, and I am not sure why a userspace task is permitted to do this.
if I look at snd_power_ref_and_wait() it seems there's a possibility of an infinite loop where we never return anything. Going back to D0 state could take some time, typically hundreds of ms, so maybe this happens because we report an error freezing a task because a power-related task is in progress.
@tiwai @perexg I could use your guidance on this. Not sure what is happening here.
Hrm, weird. It's the place where suspend_prepare() calling the freezer, right? If so, the card power state must not be other than D0 at that point, as it's before the actual device suspend procedure takes place. (For ASoC, soc_suspend() calls snd_power_change_state() with D3hot, and this should be called after the freezer.)
The task flags (0x04) shows PF_EXITING, so this is a process being shut down, as it seems. But it still doesn't clarify why the loop for power D0 matters... We need to check whether it's really in that loop waiting for the power state D0 or not.
Does it have any influence if we replace wake_up(&card->power_sleep) with wake_up_all(&card->power_sleep) in snd_power_change_state() defined in include/sound/core.h?
Or a patch like below might help? 0001-ALSA-core-Simplify-sndpower-with-the-standard-macr.patch.txt
@tiwai I don't fully follow your explanations - suspend_prepare doesn't seem to be part of sound/ and not clear either on what ASoC does with D3hot. I clearly don't have any understanding of the freeze part.
I added a simple check to trace when we start the loop to go back to D0, we'll see if this confirms the analysis. https://github.com/thesofproject/linux/pull/3374
Yes, that'll be helpful for understanding the situation. Also, it'd be good to have debug prints at snd_power_change_state(), too.
it seems we have a 100% reproducible case on a Chromebook, see https://sof-ci.01.org/linuxpr/PR3374/build6987/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-playback-5
[ 1080.546679] kernel: Restarting tasks ...
[ 1080.549004] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting loop, power state 512
[ 1080.555947] kernel: done.
[ 1080.558166] kernel: PM: suspend exit
[ 1080.560144] kernel: rfkill: input handler enabled
[ 1080.576412] kernel: rfkill: input handler disabled
[ 1080.593997] kernel: rt5682 i2c-10EC5682:00: div[0]=1
[ 1080.594003] kernel: rt5682 i2c-10EC5682:00: div[1]=2
[ 1081.040086] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1081.040123] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1081.540251] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1081.540293] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1085.770336] kernel: PM: suspend entry (s2idle)
[ 1085.771523] kernel: Filesystems sync: 0.001 seconds
[ 1085.773382] kernel: Freezing user space processes ...
[ 1105.774200] kernel: Freezing of tasks failed after 20.005 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 1105.774508] kernel: task:aplay state:D stack:12928 pid:22896 ppid: 22820 flags:0x00000004
[ 1105.774526] kernel: Call Trace:
[ 1105.774530] kernel:
[ 1105.774540] kernel: __schedule+0x2ac/0x850
[ 1105.774562] kernel: schedule+0x49/0xc0
[ 1105.774572] kernel: schedule_timeout+0x1d5/0x340
[ 1105.774581] kernel: ? del_timer_sync+0x80/0x80
[ 1105.774604] kernel: snd_power_ref_and_wait+0xdd/0x130 [snd]
[ 1105.774626] kernel: ? wake_up_q+0x80/0x80
[ 1105.774641] kernel: snd_power_wait+0xd/0x50 [snd]
[ 1105.774663] kernel: snd_pcm_common_ioctl+0x71/0x11c0 [snd_pcm]
[ 1105.774691] kernel: ? selinux_file_ioctl+0x16f/0x210
[ 1105.774704] kernel: ? lock_release+0x14b/0x3f0
[ 1105.774721] kernel: snd_pcm_ioctl+0x1e/0x30 [snd_pcm]
[ 1105.774742] kernel: __x64_sys_ioctl+0x7b/0xb0
[ 1105.774756] kernel: do_syscall_64+0x38/0x90
[ 1105.774764] kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1105.774773] kernel: RIP: 0033:0x7f873608b50b
[ 1105.774780] kernel: RSP: 002b:00007fff07897bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1105.774788] kernel: RAX: ffffffffffffffda RBX: 0000559a3302bef0 RCX: 00007f873608b50b
[ 1105.774793] kernel: RDX: 00007fff07897bc0 RSI: 0000000040184150 RDI: 0000000000000004
[ 1105.774797] kernel: RBP: 0000559a3302be70 R08: 0000000000000000 R09: 0000559a3302d120
[ 1105.774802] kernel: R10: 0000559a33013010 R11: 0000000000000246 R12: 0000559a3302d100
[ 1105.774806] kernel: R13: 0000000000000010 R14: 0000000000000000 R15: 0000000000001000
This looks like a sequence of suspend-resumes where we try to restore the D0 state when resuming but somehow never reach that state, and we suspend with an active thread?
ASoC does the resume in a deferred work, and that might be the reason? The work is synced at snd_soc_suspend(), but it's called after user process freezing, hence it's too late.
Thanks @tiwai. Would it change anything if we did the workqueue sync in a .prepare callback?
I guess this won't fit,the PM prepare callback is handled at a later stage, IIRC. But I might be wrong, better to test actually, of course.
BTW, could you check whether the patch I attached in the above makes any difference?
I added the traces in snd_power_change_state, I don't really understand the results though
[ 1076.188123] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: requesting state 512 in snd_power_change_state
[ 1076.188130] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: waking up power_sleep in snd_power_change_state
[ 1076.188133] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: done in snd_power_change_state
[ 1076.189413] kernel: OOM killer enabled.
[ 1076.189416] kernel: Restarting tasks ...
[ 1076.191592] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting loop, power state 512
[ 1076.196926] kernel: done.
[ 1076.207334] kernel: PM: suspend exit
[ 1076.211122] kernel: rfkill: input handler enabled
[ 1076.224852] kernel: rt5682 i2c-10EC5682:00: div[0]=1
[ 1076.224859] kernel: rt5682 i2c-10EC5682:00: div[1]=2
[ 1076.231123] kernel: rfkill: input handler disabled
[ 1076.681421] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1076.681462] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1077.181638] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1077.181680] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1081.422331] kernel: PM: suspend entry (s2idle)
[ 1081.422860] kernel: Filesystems sync: 0.000 seconds
[ 1081.423863] kernel: Freezing user space processes ...
[ 1101.437333] kernel: Freezing of tasks failed after 20.007 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 1101.437674] kernel: task:aplay state:D stack:12968 pid:22894 ppid: 22818 flags:0x00000004
I will give it a try with the .prepare step. prepare is supposed to be before any suspend/freeze. I will test your patch later today.
@tiwai I added a .prepare and that's not helping. https://sof-ci.01.org/linuxpr/PR3374/build6991/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-playback-5 shows we have exactly the same error before the .prepare is invoked.
I'll try your patch now.
@tiwai unfortunately your patch doesn't seem to improve the results. It's the same issue that we wait for the device to go back to D0 and start a suspend before the resume is complete. You can see in https://github.com/thesofproject/linux/pull/3374/commits/5001794c5eff06fa5ad56b490676f6c68eb7bee5 that there are two traces, and only the first is found below. the log to indicate the end of the wait is not coming out.
[ 1083.637373] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: requesting state 512 in snd_power_change_state
[ 1083.637380] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: waking up power_sleep in snd_power_change_state
[ 1083.637384] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: done in snd_power_change_state
[ 1083.638706] kernel: OOM killer enabled.
[ 1083.638709] kernel: Restarting tasks ...
[ 1083.645772] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting wait, power state 512
[ 1083.649936] kernel: done.
[ 1083.657878] kernel: rfkill: input handler enabled
[ 1083.662982] kernel: PM: suspend exit
[ 1083.677169] kernel: rfkill: input handler disabled
[ 1083.685855] kernel: rt5682 i2c-10EC5682:00: div[0]=1
[ 1083.685860] kernel: rt5682 i2c-10EC5682:00: div[1]=2
[ 1084.129376] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1084.129428] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1084.629345] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1084.629396] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1088.871217] kernel: PM: suspend entry (s2idle)
[ 1088.873518] kernel: Filesystems sync: 0.002 seconds
[ 1088.875950] kernel: Freezing user space processes ...
[ 1108.879902] kernel: Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 1108.880179] kernel: task:aplay state:D stack:12968 pid:22895 ppid: 22819 flags:0x00000004
[ 1108.880195] kernel: Call Trace:
[ 1108.880199] kernel:
[ 1108.880208] kernel: __schedule+0x2ac/0x850
[ 1108.880229] kernel: schedule+0x49/0xc0
[ 1108.880242] kernel: snd_power_ref_and_wait.cold+0xdb/0xe1 [snd]
[ 1108.880266] kernel: ? wait_woken+0x70/0x70
[ 1108.880277] kernel: snd_power_wait+0xd/0x50 [snd]
[ 1108.880298] kernel: snd_pcm_common_ioctl+0x71/0x11c0 [snd_pcm]
[ 1108.880325] kernel: ? selinux_file_ioctl+0x16f/0x210
[ 1108.880338] kernel: ? xfd_validate_state+0x18/0x70
[ 1108.880354] kernel: snd_pcm_ioctl+0x1e/0x30 [snd_pcm]
https://github.com/thesofproject/linux/pull/3376 removes the workqueue entirely. Let's see what happens.
https://sof-ci.01.org/linuxpr/PR3376/build6994/devicetest/ shows no issues when ASoC resumes synchronously. Humm @lgirdwood will need your help on this one :-)
@plbossart Thanks for checking. My patch would be still good as a cleanup, so I'm going to submit in anyway with a slight rephrasing.
For syncing the deferred resume work, we might need to use the pm notifier. Could you try the patch below? soc-deferred-resume-notify.diff.txt
It fails right at the second iteration. Does the dmesg contains the first iteration's log? I was looking at the passing platforms and the logs are kind of short for the 5 test (more like one round).
In the log I can see the suspend from the previous step along with the wakeup. What makes it interesting is that we don't have the pcm: trigger stream 0 dir 0 cmd 1
after the wakeup (from the first suspend). I have tested locally and it is always coming and I don't have similar issue.
The other interesting thing is that in a working cases when we wake up we have the following ipc sequence: ipc rx: 0x70000000: FW_READY ipc tx: 0x90030000: GLB_TRACE_MSG: DMA_PARAMS_EXT ipc tx: 0x40020000: GLB_PM_MSG: CTX_RESTORE ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION ... GLB_TPLG_MSG messages
In the failing logs: ipc rx: 0x70000000: FW_READY ipc tx: 0x90030000: GLB_TRACE_MSG: DMA_PARAMS_EXT ipc tx: 0x30100000: GLB_TPLG_MSG: PIPE_NEW ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION ... GLB_TPLG_MSG messages ... ipc tx: 0x40020000: GLB_PM_MSG: CTX_RESTORE
I don't know why we are missing the trigger start on the failing case, but can that be part of the puzzle? The resume sequence also of concern, but I can not figure out why it is like that.
@ujfalusi The issue is likely because the suspend is invoked before the deferred resume work is completed. The suspend procedure is to perform freezer at first, then process driver's suspend callbacks. And the problem is at the freezer stage, where the task seems recognized as unfreezable. So my test patch above tries to finish the deferred resume work before the freezer stage.
On the second thought, the primary problem is that the aplay task is recognized as unfreezable, although it's sticking in schedule(). The likely cause is that it's an UNINTERRUPTIBLE sleep, and there is a good reason for that; we don't want to allow user-space proceeding before the card's resume finishes.
And I looked through the code and found a possible (but hackish) workaround using lock_system_sleep(). Another test patch is below. It's less-intrusive, so this might be taken more easily, once after we confirm that it works. snd-power-lock-ignore-freezer.diff.txt
Scratch my previous patch. On the third thought, it may cause another problem; namely, when a deferred resume finishes at snd_soc_suspend(), it wakes up the sleeping task in snd_power_wait*() unexpectedly after the freezer phase. Although the wake-up might be not processed yet at that point, it's a flaky operation...
So syncing with PM notifier might be the way to go in the end.
I tried you patch @tiwai but something doesn't seem right, the flush doesn't complete
[ 49.167567] cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: requesting state 512 in snd_power_change_state
[ 49.167578] cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: waking up power_sleep in snd_power_change_state
[ 49.167584] cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: done in snd_power_change_state
[ 49.220919] rt5682 i2c-10EC5682:00: div[0]=1
[ 49.220931] rt5682 i2c-10EC5682:00: div[1]=2
[ 49.295851] OOM killer enabled.
[ 49.295858] Restarting tasks ... done.
[ 49.301875] cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting wait, power state 512
[ 49.303735] plb: soc_resume_notify: start, action 4
[ 49.303739] plb: soc_resume_notify: done
[ 49.304002] PM: suspend exit
[ 49.308326] rfkill: input handler enabled
[ 49.379127] rfkill: input handler disabled
[ 49.664720] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 49.664777] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 50.164387] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 50.164437] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 52.317621] ax88179_178a 2-2.3:1.0 enx000ec66a7f60: ax88179 - Link status is: 1
[ 54.853000] PM: suspend entry (s2idle)
[ 54.854956] Filesystems sync: 0.001 seconds
[ 54.865312] plb: soc_resume_notify: start, action 3
[ 54.865328] plb: soc_resume_notify: before flush <<< STUCK PAST THIS POINT
I try adding more traces but it's just awful to debug this remotely, I have to get a local device to debug this further.
OK, then could you try to replace flush_work(&card->deferred_resume_work) with snd_power_wait(card->snd_card) in the patch? soc-deferred-resume-notify-v2.diff.txt
Not much better, the traces I added show the resume does not complete. it seems like we have a DAPM operation that is stuck.
pr_warn("plb: %s: 7\n", __func__); << last trace seen
/* Recheck all endpoints too, their state is affected by suspend */
dapm_mark_endpoints_dirty(card);
pr_warn("plb: %s: 7.5\n", __func__);
snd_soc_dapm_sync(&card->dapm);
pr_warn("plb: %s: 8\n", __func__); <<< not reached.
I added one more trace to see what happens
https://sof-ci.01.org/linuxpr/PR3374/build7007/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-playback-5 shows that snd_soc_dapm_sync() never completes.
[ 1307.727716] kernel: plb: soc_resume_deferred: start
[ 1307.727721] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: requesting state 512 in snd_power_change_state
[ 1307.727725] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: waking up power_sleep in snd_power_change_state
[ 1307.727729] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: done in snd_power_change_state
[ 1307.727731] kernel: plb: soc_resume_deferred: 1
[ 1307.727733] kernel: plb: soc_resume_deferred: 2
[ 1307.768488] kernel: plb: soc_resume_deferred: 3
[ 1307.768848] kernel: plb: soc_resume_deferred: 4
[ 1307.768853] kernel: plb: soc_resume_deferred: 5
[ 1307.768855] kernel: plb: soc_resume_deferred: 6
[ 1307.768856] kernel: plb: soc_resume_deferred: 7
[ 1307.768878] kernel: plb: soc_resume_deferred: 7.5 <<< not completing.
[ 1307.770925] kernel: rt5682 i2c-10EC5682:00: div[0]=1
[ 1307.770933] kernel: rt5682 i2c-10EC5682:00: div[1]=2
[ 1308.011582] kernel: OOM killer enabled.
[ 1308.011586] kernel: Restarting tasks ... done.
[ 1308.018086] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting wait, power state 512
[ 1308.018864] kernel: plb: soc_resume_notify: start, action 4
[ 1308.018868] kernel: plb: soc_resume_notify: done
[ 1308.018928] kernel: PM: suspend exit
[ 1308.032658] kernel: rfkill: input handler enabled
[ 1308.090662] kernel: rfkill: input handler disabled
[ 1308.225192] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1308.225231] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1308.725171] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1308.725212] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1311.001966] kernel: ax88179_178a 2-2.3:1.0 enx000ec66a7cea: ax88179 - Link status is: 1
[ 1313.252726] kernel: PM: suspend entry (s2idle)
[ 1313.255093] kernel: Filesystems sync: 0.002 seconds
[ 1313.266913] kernel: plb: soc_resume_notify: start, action 3
[ 1313.266923] kernel: plb: soc_resume_notify: before snd_power_wait
[ 1313.266930] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting wait, power state 512
[ 1313.281005] kernel: rfkill: input handler enabled
[ 2801.789901] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting wait, power state 512
I have no idea why this might be the case.
Hrm, that's the place using async execution, and it might be the cause.
Then, let's try go back to another approach I scratched. Below is a revised patch, adding a workaround for the resume-in-suspend problem. snd-power-lock-ignore-freezer-v2.diff.txt
I guess this will hit the similar stall, though. That is, the main cause might be rather the DAPM sync stall in the deferred resume.
@tiwai, @plbossart, this might or might not help to see the DAPM sequence?
diff --git a/sound/soc/soc-core.c b/sound/soc/soc-core.c
index 434e61b46983..099234a26fd4 100644
--- a/sound/soc/soc-core.c
+++ b/sound/soc/soc-core.c
@@ -2312,6 +2312,7 @@ int snd_soc_register_card(struct snd_soc_card *card)
INIT_LIST_HEAD(&card->dobj_list);
card->instantiated = 0;
+ card->pop_time = 1;
mutex_init(&card->mutex);
mutex_init(&card->dapm_mutex);
mutex_init(&card->pcm_mutex);
Hrm, that's the place using async execution, and it might be the cause.
Then, let's try go back to another approach I scratched. Below is a revised patch, adding a workaround for the resume-in-suspend problem. snd-power-lock-ignore-freezer-v2.diff.txt
I guess this will hit the similar stall, though. That is, the main cause might be rather the DAPM sync stall in the deferred resume.
I'll give it a try, but I wonder if the patch fixes the wrong problem? the issue is try to suspend while resume is not complete. it's not a 'resume-in-suspend' case, or maybe I misunderstood your wording.
At least we can see whether it's in the core code or not. This patch will make the freezer ignoring those aplay process in the power-check loop, and proceed further. If the things work with that, indeed that's an issue about the process handling. OTOH, if the problem persists (it's more likely, I guess), it implies that the primary cause is rather the device itself that doesn't finish the resume properly at the second time.
Would the device resume properly at the second time if you don't go to suspend?
it looks like the DAPM issue remains, it looks like no matter what we try we end-up not finishing the resume before we suspend.
[ 1076.123368] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x40020000: GLB_PM_MSG: CTX_RESTORE
[ 1076.123424] kernel: plb: soc_resume_deferred: start
[ 1076.123429] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: requesting state 512 in snd_power_change_state
[ 1076.123433] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: waking up power_sleep in snd_power_change_state
[ 1076.123436] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_change_state: done in snd_power_change_state
[ 1076.123439] kernel: plb: soc_resume_deferred: 1
[ 1076.123440] kernel: plb: soc_resume_deferred: 2
[ 1076.124692] kernel: OOM killer enabled.
[ 1076.124694] kernel: Restarting tasks ...
[ 1076.127252] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting wait, power state 512
[ 1076.129039] kernel: done.
[ 1076.131905] kernel: PM: suspend exit
[ 1076.142481] kernel: rfkill: input handler enabled
[ 1076.156187] kernel: rfkill: input handler disabled
[ 1076.165710] kernel: plb: soc_resume_deferred: 3
[ 1076.165932] kernel: plb: soc_resume_deferred: 4
[ 1076.165936] kernel: plb: soc_resume_deferred: 5
[ 1076.165938] kernel: plb: soc_resume_deferred: 6
[ 1076.165939] kernel: plb: soc_resume_deferred: 7
[ 1076.165956] kernel: plb: soc_resume_deferred: 7.5
[ 1076.166977] kernel: rt5682 i2c-10EC5682:00: div[0]=1
[ 1076.166982] kernel: rt5682 i2c-10EC5682:00: div[1]=2
[ 1076.616650] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1076.616694] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1077.116984] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1077.117027] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG: DMA_POSITION
[ 1081.363503] kernel: PM: suspend entry (s2idle)
[ 1081.363509] kernel: PM: suspend exit
With more traces, the resume can be shown as getting stuck in dapm_seq_run()
pr_warn("plb: %s: 10\n", __func__); <<< reached
/* Now power up. */
dapm_seq_run(card, &up_list, event, true);
pr_warn("plb: %s: 11\n", __func__); <<< not reached
[ 1301.115225] kernel: plb: dapm_power_widgets: 9
[ 1301.115232] kernel: plb: dapm_power_widgets: 10
[ 1301.115648] kernel: rt5682 i2c-10EC5682:00: div[0]=1
[ 1301.115660] kernel: rt5682 i2c-10EC5682:00: div[1]=2
[ 1301.326013] kernel: OOM killer enabled.
[ 1301.326019] kernel: Restarting tasks ...
[ 1301.329894] kernel: cml_rt1011_rt5682 cml_rt1011_rt5682: plb: snd_power_ref_and_wait: starting wait, power state 512
[ 1301.331519] kernel: done.
[ 1301.333441] kernel: PM: suspend exit
Curiously with DAPM pop debugging (pop_time = 1) the CI passes, but we have different fails (which might be unrelated FS corruption?)
@ujfalusi we've seen those 'hung kernel tasks' errors before, I think they appeared in the last kernel cycle - I filed a new bug since it's not tracked for some reason by our CI folks https://github.com/thesofproject/linux/issues/3387
@ujfalusi for my education, what does pop_time do? How would it change the DAPM behavior?
@ujfalusi for my education, what does pop_time do? How would it change the DAPM behavior?
@plbossart, it does two things, unfortunately,
With pop_time = 1
we add 1 jiffies long uninterruptible sleep essentially, which can interfere for sure. We could add a hack patch to skip the sleep if pop_sleep is 1.
Let me try that...
@plbossart, now it looks more promising (information wise): https://sof-ci.01.org/linuxpr/PR3384/build7031/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-playback-5
The last dapm event on resume before the suspend is:
[ 1304.376201] kernel: rt5682 i2c-10EC5682:00: pop test : HP Amp PRE_PMU
end we do not get to the POST_PMU side of the widget...
thanks @ujfalusi, could this be a deadlock in the rt5682 driver:
static int rt5682_hp_event(struct snd_soc_dapm_widget *w,
struct snd_kcontrol *kcontrol, int event)
{
struct snd_soc_component *component =
snd_soc_dapm_to_component(w->dapm);
struct rt5682_priv *rt5682 = snd_soc_component_get_drvdata(component);
switch (event) {
case SND_SOC_DAPM_PRE_PMU:
snd_soc_component_update_bits(component, RT5682_HP_CTRL_2,
RT5682_HP_C2_DAC_AMP_MUTE, 0);
snd_soc_component_update_bits(component, RT5682_HP_LOGIC_CTRL_2,
RT5682_HP_LC2_SIG_SOUR2_MASK, RT5682_HP_LC2_SIG_SOUR2_REG);
snd_soc_component_update_bits(component,
RT5682_DEPOP_1, 0x60, 0x60);
snd_soc_component_update_bits(component,
RT5682_DAC_ADC_DIG_VOL1, 0x00c0, 0x0080);
mutex_lock(&rt5682->jdet_mutex); <<< STUCK HERE?
snd_soc_component_update_bits(component, RT5682_HP_CTRL_2,
RT5682_HP_C2_DAC_L_EN | RT5682_HP_C2_DAC_R_EN,
RT5682_HP_C2_DAC_L_EN | RT5682_HP_C2_DAC_R_EN);
usleep_range(5000, 10000);
snd_soc_component_update_bits(component, RT5682_CHARGE_PUMP_1,
RT5682_CP_SW_SIZE_MASK, RT5682_CP_SW_SIZE_L);
mutex_unlock(&rt5682->jdet_mutex);
break;
That's my suspect, too. You can try like this one. rt5682-jack-mutex-test.diff.txt
The work does not have mutex used, but it is certainly there. I have pushed a patch to track the mutex uses in the rt5682 driver.
From the logs:
[ 35.396923] kernel: LDP: rt5682_jack_detect_handler (sound/soc/codecs/rt5682.c, 1105): ENTER (jdet_mutex)
[ 35.396928] kernel: LDP: rt5682_jack_detect_handler (sound/soc/codecs/rt5682.c, 1107): ENTER (calibrate_mutex)
...
[ 35.423677] kernel: rt5682 i2c-10EC5682:00: pop test : Queue HP Amp: reg=0xffffffff, 0x1/0x1
[ 35.423681] kernel: rt5682 i2c-10EC5682:00: pop test : HP Amp PRE_PMU
[ 35.424480] kernel: LDP: rt5682_hp_event (sound/soc/codecs/rt5682.c, 1546): ENTER (jdet_mutex)
Trying the patch from @tiwai, next, keeping the prints, but this time they are pr_warn()
The lock on resume is introduced by: 8deb34a90f06374fd26f722c2a79e15160f66be7 ("ASoC: rt5682: fix the wrong jack type detected")
The lock happens this way:
A rt5682_resume() -> mod_delayed_work() for jack_detect_work
B DAPM sequence starts ( DAPM is locked now)
A1. rt5682_jack_detect_handler() scheduled
Takes both jdet_mutex
and calibrate_mutex
Calls in to rt5682_headset_detect() which tries to take DAPM lock, it starts to wait for it as B path took it already.
B1. DAPM sequence reaches teh "HP Amp", rt5682_hp_event() tries to take the jdet_mutex
, but it is locked in A1, so it waits.
Deadlock.
Watching this investigation has been more interesting than some detective shows. However I wonder if there is with hindsight some tool and/or approach that could have found this deadlock faster? lockdep
or something? It would be great to save all the experts involved a lot of time the next time.
@marc-hb, CONFIG_DEBUG_LOCKDEP
for sure would have been pointing to this earlier. Don't we have it enabled for CI?
@tiwai, @plbossart, reverting said commit (8deb34a90f06374fd26f722c2a79e15160f66be7) is solving the issue: https://github.com/thesofproject/linux/pull/3390
Looking at the code, it is not that trivial to fix it with the three locks around, I have tried to drop the jdet_work and mutex and use the rt5682_jack_detect_handler() only, but it caused some race on module removal (the work rescheduled after cancel_delayed_work_sync(), another mutex + a flag might help, but it starts to look ugly.
@ujfalusi Thanks for tracking this. Yes, that looks like a typical ABBA deadlock. I guess we may simply drop jdet_mutex, and replace with dapm mutex?
Also, as seen my fix attempt, cancel_delayed_work_sync() should be outside the mutex that is taken by the worker itself. Otherwise the call would be another cause of deadlock.
@marc-hb,
CONFIG_DEBUG_LOCKDEP
for sure would have been pointing to this earlier. Don't we have it enabled for CI?
we have this enabled:
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
# CONFIG_DEBUG_LOCKDEP is not set
The latter says 'Lock dependency engine debugging', I am not sure if this would help. We want to detect the lock, not detect issues in the locking detection. It's straightforward to add if needed.
Adding Realtek team; @oder-chiou @shumingfan @jack-cy-yu and @bardliao FYI.
@tiwai, @plbossart, it looks like I have foujd a way to fix it w/o revert: #3392
Would you take a look if you can spot if I'm doing it wrong?
@ujfalusi LGTM
I see that #3392 was merged on Jan 28 but with kernel 5.17.0-rc1-daily-default-20220202-0-13669-gbb45f689fa62
I found this DEBUG_LOCKS_WARN_ON(lock->magic != lock)
by chance in the logs while looking for something else:
Feb 03 05:37:33 jf-tglu-volt-sdw-01 /usr/lib/gdm3/gdm-x-session[624]: (II) xfree86: Adding drm device (/dev/dri/card0)
Feb 03 05:37:33 jf-tglu-volt-sdw-01 /usr/lib/gdm3/gdm-x-session[624]: (II) systemd-logind: got fd for /dev/dri/card0 226:0 fd 12 paused 0
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: warn: topology ABI is more recent than kernel
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: sof_sdw sof_sdw: ASoC: Parent card not yet available, widget card binding deferred
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: sof_sdw sof_sdw: hda_dsp_hdmi_build_controls: no PCM in topology for HDMI converter 4
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: sof_sdw sof_sdw: hda_dsp_hdmi_build_controls: no PCM in topology for HDMI converter 5
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: sof_sdw sof_sdw: hda_dsp_hdmi_build_controls: no PCM in topology for HDMI converter 6
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: sof_sdw sof_sdw: hda_dsp_hdmi_build_controls: no PCM in topology for HDMI converter 7
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: sof_sdw sof_sdw: hda_dsp_hdmi_build_controls: no PCM in topology for HDMI converter 8
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: ------------[ cut here ]------------
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: DEBUG_LOCKS_WARN_ON(lock->magic != lock)
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: WARNING: CPU: 3 PID: 186 at kernel/locking/mutex.c:579 __mutex_lock+0x8ee/0x910
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: Modules linked in: fuse snd_soc_sof_sdw snd_soc_intel_hda_dsp_common snd_soc_intel_sof_maxim_common snd_usb_audio snd_usbmidi_lib snd_h>
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: int340x_thermal_zone int3400_thermal acpi_thermal_rel intel_pmc_core squashfs drm drm_panel_orientation_quirks efivarfs spi_pxa2xx_pla>
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: CPU: 3 PID: 186 Comm: kworker/3:2 Not tainted 5.17.0-rc1-daily-default-20220202-0-13669-gbb45f689fa62 #2
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: Hardware name: Google Volteer/Volteer, BIOS 06/30/2020
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: Workqueue: events_power_efficient rt5682_jack_detect_handler [snd_soc_rt5682]
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: RIP: 0010:__mutex_lock+0x8ee/0x910
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: Code: c0 0f 84 a0 f7 ff ff 44 8b 1d d6 fe e4 00 45 85 db 0f 85 90 f7 ff ff 48 c7 c6 77 ee fa a4 48 c7 c7 d2 b4 f9 a4 e8 d5 71 f9 ff <0f>
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: RSP: 0018:ffffa0db80cffd70 EFLAGS: 00010282
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: RDX: 0000000000000001 RSI: ffffffffa4fbb461 RDI: 00000000ffffffff
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: RBP: ffffa0db80cffe10 R08: 0000000000000000 R09: ffffa0db80cffb98
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: R13: ffff953f51f61c40 R14: 0000000000000000 R15: ffff9542af9efd05
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: FS: 0000000000000000(0000) GS:ffff9542af800000(0000) knlGS:0000000000000000
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: CR2: 00007fd326f2a220 CR3: 0000000105162005 CR4: 0000000000770ee0
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: PKRU: 55555554
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: Call Trace:
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: <TASK>
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: ? sched_clock_cpu+0xd/0xd0
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: ? rt5682_jack_detect_handler+0x84/0x2e0 [snd_soc_rt5682]
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: ? rt5682_jack_detect_handler+0x84/0x2e0 [snd_soc_rt5682]
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: rt5682_jack_detect_handler+0x84/0x2e0 [snd_soc_rt5682]
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: process_one_work+0x297/0x530
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: worker_thread+0x48/0x3c0
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: ? process_one_work+0x530/0x530
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: kthread+0xeb/0x120
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: ? kthread_complete_and_exit+0x20/0x20
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: ret_from_fork+0x1f/0x30
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: </TASK>
Feb 03 05:37:33 jf-tglu-volt-sdw-01 kernel: ---[ end trace 0000000000000000 ]---
Feb 03 05:37:33 jf-tglu-volt-sdw-01 /usr/lib/gdm3/gdm-x-session[624]: (--) PCI:*(0@0:2:0) 8086:9a49:8086:9a49 rev 0, Mem @ 0x90000000/16777216, 0x80000000/268435456, I/O @ 0x00001>
Feb 03 05:37:33 jf-tglu-volt-sdw-01 /usr/lib/gdm3/gdm-x-session[624]: (II) LoadModule: "glx"
Feb 03 05:37:33 jf-tglu-volt-sdw-01 /usr/lib/gdm3/gdm-x-session[624]: (II) Loading /usr/lib/xorg/modules/extensions/libglx.so
Observed twice in a row: first on
jf-cml-hel-rt5682-01
then onjf-cml-hel-rt5682-02
https://sof-ci.01.org/sofpr/PR5207/build11616/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-playback-5 https://sof-ci.01.org/sofpr/PR5207/build11618/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-playback-5 https://sof-ci.01.org/sofpr/PR5227/build11700/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-playback-5 https://sof-ci.01.org/sofpr/PR5202/build11725/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-playback-5
check-suspend-resume-with-capture-5.sh passed both times.
Similar issues: #2433 #2561
Start Time: 2022-01-14 02:20:28 UTC End Time: 2022-01-14 02:56:57 UTC Kernel Commit: ba28f9bb SOF Commit: 6b8b1cee16bc