[BUG] IDC hang stopping DP capture pipeline

andyross commented 10 months ago

Pipelines reference multi-core components don't seem to work right. Pick a topology with a DP component (c.f. the mtl-007-drop-stable branch), start a capture stream (e.g. arecord -Dhw:0,27 -c4 -f s32_le -r 48000 test.wav). It works fine.

Now kill the arecord process. The kernel reports an IPC timeout and fails to recover (in fact the module is wedged and won't unload, requiring a system reboot):

[   44.561043] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x12010000|0x0
[   44.561071] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump start ]------------
[   44.561104] sof-audio-pci-intel-mtl 0000:00:1f.3: Host IPC initiator: 0x92010000|0x0|0x0, target: 0x1b060000|0x0|0x0, ctl: 0x3
[   44.561114] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ IPC dump end ]------------
[   44.561121] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[   44.561126] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[   44.561132] sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[   44.561144] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM status: 0x5, ROM error: 0x0
[   44.561151] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM debug status: 0x50000005, ROM debug error: 0x0
[   44.561159] sof-audio-pci-intel-mtl 0000:00:1f.3: ROM feature bit not enabled
[   44.561165] sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[   44.561180] sof-audio-pci-intel-mtl 0000:00:1f.3: failed to free pipeline widget pipeline.18
[   45.065031] sof-audio-pci-intel-mtl 0000:00:1f.3: ipc timed out for 0x12020000|0x0
[   45.065061] sof-audio-pci-intel-mtl 0000:00:1f.3: IPC timeout
[   45.065080] sof-audio-pci-intel-mtl 0000:00:1f.3: failed to free pipeline widget pipeline.19
[   45.065094] sof-audio-pci-intel-mtl 0000:00:1f.3: Failed to free connected widgets
[   45.065122] sof-audio-pci-intel-mtl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_hw_free on 0000:00:1f.3: -110

Adding some tracing of my own, what's happening is that the firmware is handling an IPC delete pipeline command:

ipc4_process_glb_message()
 ipc4_delete_pipeline()
  ipc_pipeline_free()
   ipc_pipeline_module_free()

And in ipc_pipeline_module_free(), it recognizes that a component is assigned to a different core, and calls ipc_comp_free_remote() (instead of ipc_comp_free()). And function is implemented in terms of idc_send_msg() with and IDC_BLOCKING flag. And it apparently spins forever waiting on a reply from the other core that doesn't arrive.

Note that the spinning seems to prevent the log thread from running, so logs won't in general flush with this bug. I had to use the acetool.py script from upstream Zephyr and CONFIG_LOG_PRINTK=n to force synchronous printk() trace output into the winstream buffer instead.

marcinszkudlinski commented 10 months ago

@andyross it looks like a driver problem fixed here https://github.com/thesofproject/linux/pull/4705 a core was turned off too soon and IDC got stuck do you have this fix in kernel?

andyross commented 10 months ago

That sounds promising. As mentioned, I'm travelling this week and can't test. @udaymb , can you validate against the 6.1 device kernel tree?

andyross commented 10 months ago

Digression: Though I worry a little about that architecture: the host will preemptively turn off any cores that aren't referenced by the currently open pipelines? What if the device has a workload on one of them that isn't pipeline-related? How does it NAK the poweroff? (This gets back to the PM architecture arguments, but it's worth pointing out that this exactly the kind of problem that lazy/idle-based PM is designed to solve)

udaymb commented 10 months ago

@andyross - I have verified with the latest 6.1 kernel and issue is not observed - it contains https://github.com/thesofproject/linux/pull/4705 patches that are merged - https://chromium.googlesource.com/chromiumos/third_party/kernel/+log/27a5b5b8a65b4502aa66bb6d1ef5c41bf332420e

lgirdwood commented 10 months ago

Digression: Though I worry a little about that architecture: the host will preemptively turn off any cores that aren't referenced by the currently open pipelines? What if the device has a workload on one of them that isn't pipeline-related? How does it NAK the poweroff? (This gets back to the PM architecture arguments, but it's worth pointing out that this exactly the kind of problem that lazy/idle-based PM is designed to solve)

Ack, the host actually can flip the PM bits IIUC, but this should be done with permission from the RTOS. Conversly the RTOS should be able to request a core power down from the host, @andyross how does this fit in with Zephyr workloads ? I would imagine Zephyr knows whether work is running, pending or done and can therefore work with the host re powering down ?

andyross commented 10 months ago

@andyross how does this fit in with Zephyr workloads ? I would imagine Zephyr knows whether work is running, pending or done and can therefore work with the host re powering down ?

Sort of. "There is work available" is just isomorphic to "the idle thread is not running". That's the basic idea behind Zephyr runtime PM: when we reach idle we call a function to see what to do (e.g. just WAITI, or call into platform suspend code, etc...)

There's more complexity in the case of non-runnable pinned threads though. You might have a thread that's merely sleeping, but pinned to a core that's about to shut down. The OS can in theory check that, but even then that might not be right. The waiting thread might be e.g. the DP scheduler worker, which is waiting on a semaphore that won't be given until the pipeline needs it. So there has to be some application-level intelligence here.

The way other OSes do this is generally with a ref-counted wakelock or somesuch: the app says "I need CPU2" while it has something it knows will need the core, and that bumps a count. The suspend code only does its thing when the count reaches zero.

The complication here is that the host is involved. So it needs to see a cooked version of that count, such that it can bring a core up if it knows a pipeline needs it (and if it isn't already running!), but only turn it off when the OS reports done. Maybe a bitfield near the FW_STATUS word of per-core bits that indicate "I have work, don't turn me off", and a quick poll in the ISR that checks for cores that can be shut down?

thesofproject / sof

[BUG] IDC hang stopping DP capture pipeline #8638