Open andyross opened 10 months ago
@andyross it looks like a driver problem fixed here https://github.com/thesofproject/linux/pull/4705 a core was turned off too soon and IDC got stuck do you have this fix in kernel?
That sounds promising. As mentioned, I'm travelling this week and can't test. @udaymb , can you validate against the 6.1 device kernel tree?
Digression: Though I worry a little about that architecture: the host will preemptively turn off any cores that aren't referenced by the currently open pipelines? What if the device has a workload on one of them that isn't pipeline-related? How does it NAK the poweroff? (This gets back to the PM architecture arguments, but it's worth pointing out that this exactly the kind of problem that lazy/idle-based PM is designed to solve)
@andyross - I have verified with the latest 6.1 kernel and issue is not observed - it contains https://github.com/thesofproject/linux/pull/4705 patches that are merged - https://chromium.googlesource.com/chromiumos/third_party/kernel/+log/27a5b5b8a65b4502aa66bb6d1ef5c41bf332420e
Digression: Though I worry a little about that architecture: the host will preemptively turn off any cores that aren't referenced by the currently open pipelines? What if the device has a workload on one of them that isn't pipeline-related? How does it NAK the poweroff? (This gets back to the PM architecture arguments, but it's worth pointing out that this exactly the kind of problem that lazy/idle-based PM is designed to solve)
Ack, the host actually can flip the PM bits IIUC, but this should be done with permission from the RTOS. Conversly the RTOS should be able to request a core power down from the host, @andyross how does this fit in with Zephyr workloads ? I would imagine Zephyr knows whether work is running, pending or done and can therefore work with the host re powering down ?
@andyross how does this fit in with Zephyr workloads ? I would imagine Zephyr knows whether work is running, pending or done and can therefore work with the host re powering down ?
Sort of. "There is work available" is just isomorphic to "the idle thread is not running". That's the basic idea behind Zephyr runtime PM: when we reach idle we call a function to see what to do (e.g. just WAITI, or call into platform suspend code, etc...)
There's more complexity in the case of non-runnable pinned threads though. You might have a thread that's merely sleeping, but pinned to a core that's about to shut down. The OS can in theory check that, but even then that might not be right. The waiting thread might be e.g. the DP scheduler worker, which is waiting on a semaphore that won't be given until the pipeline needs it. So there has to be some application-level intelligence here.
The way other OSes do this is generally with a ref-counted wakelock or somesuch: the app says "I need CPU2" while it has something it knows will need the core, and that bumps a count. The suspend code only does its thing when the count reaches zero.
The complication here is that the host is involved. So it needs to see a cooked version of that count, such that it can bring a core up if it knows a pipeline needs it (and if it isn't already running!), but only turn it off when the OS reports done. Maybe a bitfield near the FW_STATUS word of per-core bits that indicate "I have work, don't turn me off", and a quick poll in the ISR that checks for cores that can be shut down?
Pipelines reference multi-core components don't seem to work right. Pick a topology with a DP component (c.f. the
mtl-007-drop-stable
branch), start a capture stream (e.g.arecord -Dhw:0,27 -c4 -f s32_le -r 48000 test.wav
). It works fine.Now kill the arecord process. The kernel reports an IPC timeout and fails to recover (in fact the module is wedged and won't unload, requiring a system reboot):
Adding some tracing of my own, what's happening is that the firmware is handling an IPC delete pipeline command:
And in
ipc_pipeline_module_free()
, it recognizes that a component is assigned to a different core, and callsipc_comp_free_remote()
(instead of ipc_comp_free()). And function is implemented in terms ofidc_send_msg()
with andIDC_BLOCKING
flag. And it apparently spins forever waiting on a reply from the other core that doesn't arrive.Note that the spinning seems to prevent the log thread from running, so logs won't in general flush with this bug. I had to use the
acetool.py
script from upstream Zephyr and CONFIG_LOG_PRINTK=n to force synchronous printk() trace output into the winstream buffer instead.