Closed AreYouLoco closed 2 years ago
Cannot reproduce it. Closing for now.
EDIT: It happened again. Re-opening.
Log from jack when hang happens:
Thu Nov 11 22:09:27 2021: Jack: **** alsa_pcm: xrun of at least 0.026 msecs
Thu Nov 11 22:09:27 2021: Jack: Repreparing capture
Thu Nov 11 22:09:27 2021: Jack: Repreparing playback
Thu Nov 11 22:09:27 2021: Jack: ALSA XRun wait_status = 0
Thu Nov 11 22:09:27 2021: Jack: JackRequest::Notification
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:27 2021: Jack: JackClient::ClientNotify ref = 2 name = dbusapi notify = 3
Thu Nov 11 22:09:27 2021: Jack: JackClient::kXRunCallback
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:31 2021: ERROR: ALSA: poll time out, polled for 3998928 usecs, Retrying with a recovery, retry cnt = 1
Thu Nov 11 22:09:35 2021: ERROR: ALSA: poll time out, polled for 3998947 usecs, Retrying with a recovery, retry cnt = 2
Thu Nov 11 22:09:39 2021: ERROR: ALSA: poll time out, polled for 3998958 usecs, Retrying with a recovery, retry cnt = 3
Thu Nov 11 22:09:43 2021: ERROR: ALSA: poll time out, polled for 3998917 usecs, Retrying with a recovery, retry cnt = 4
Thu Nov 11 22:09:47 2021: ERROR: ALSA: poll time out, polled for 3998890 usecs, Retrying with a recovery, retry cnt = 5
Thu Nov 11 22:09:51 2021: ERROR: ALSA: poll time out, polled for 3998890 usecs, Reached max retry cnt = 5, Exiting
Thu Nov 11 22:09:51 2021: ERROR: JackAudioDriver::ProcessAsync: read error, stopping...
Thu Nov 11 22:09:51 2021: Jack: JackPosixThread::ThreadHandler : exit
Thanks for the log.
When seeing the issue, you can see something in system log?
No. I checked dmesg but there was nothing there.
EDIT: Will try to reproduce it and check again maybe I missed something but I don't think so.
Following the: https://www.alsa-project.org/main/index.php/XRUN_Debug I have a kernel with these options compiled in.
areyouloco@studio:~$ grep -e" CONFIG_SND_PCM_XRUN_DEBUG" -e "CONFIG_SND_VERBOSE_PROCFS" -e "CONFIG_SND_DEBUG" /boot/config-`uname -r`
CONFIG_SND_VERBOSE_PROCFS=y
CONFIG_SND_DEBUG=y
CONFIG_SND_DEBUG_VERBOSE=y
How do I proceed to debug the XRUN? Reading this:
https://www.kernel.org/doc/html/v5.14-rc6/sound/designs/procfile.html#pcm-proc-files
I've set:
# echo 11 > /proc/asound/card0/pcm0p/xrun_debug
and
# echo 11 > /proc/asound/card0/pcm0c/xrun_debug
but for card2
according to /proc/asound/cards
and hunting the bug again
I also tried setting ridiculously low buffer/period size and sometimes some XRUNs occur but nothing a JACK can't recover from. But this exactly one XRUN I am talking about didn't happen, yet! From what I see now it's not related to buffer/period size. Happens randomly after long playback. It will be hard to diagnose.
And now when I try to hunt the bug it doesn't happen;/ But I am sure at some point it will blow up. I let it run and will report when it happens.
EDIT: I've let it play over night at 128samples/3periods and 0 XRUNS so far. I really don't know when does it happen. Maybe it's related to RT kernel.
I got it! Now the JACK server hanged and what I got in dmesg is:
[ 465.098803] snd_oxfw fw1.0: XRUN: pcmC2D0p:0
[ 465.098805] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G OE 5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[ 465.098806] Hardware name: [...]
[ 465.098807] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[ 465.098811] Call Trace:
[ 465.098812] dump_stack+0x76/0x94
[ 465.098815] __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
[ 465.098823] snd_pcm_update_state+0x107/0x110 [snd_pcm]
[ 465.098830] snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
[ 465.098838] ? __switch_to+0x84/0x440
[ 465.098840] snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[ 465.098847] process_one_work+0x1ce/0x440
[ 465.098850] ? process_one_work+0x440/0x440
[ 465.098853] worker_thread+0x50/0x3b0
[ 465.098856] ? process_one_work+0x440/0x440
[ 465.098858] kthread+0x186/0x1a0
[ 465.098860] ? __kthread_parkme+0xa0/0xa0
[ 465.098862] ret_from_fork+0x22/0x30
Sun Nov 14 11:58:02 2021: Jack: **** alsa_pcm: xrun of at least 0.098 msecs
Sun Nov 14 11:58:02 2021: Jack: Repreparing capture
Sun Nov 14 11:58:02 2021: Jack: Repreparing playback
Sun Nov 14 11:58:02 2021: Jack: ALSA XRun wait_status = 0
Sun Nov 14 11:58:02 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Sun Nov 14 11:58:02 2021: Jack: JackClient::ClientNotify ref = 2 name = dbusapi notify = 3
Sun Nov 14 11:58:02 2021: Jack: JackClient::kXRunCallback
Sun Nov 14 11:58:02 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Sun Nov 14 11:58:02 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Sun Nov 14 11:58:02 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Sun Nov 14 11:58:10 2021: ERROR: ALSA: poll time out, polled for 7999047 usecs, Retrying with a recovery, retry cnt = 1
Sun Nov 14 11:58:18 2021: ERROR: ALSA: poll time out, polled for 7999050 usecs, Retrying with a recovery, retry cnt = 2
Sun Nov 14 11:58:26 2021: ERROR: ALSA: poll time out, polled for 7999065 usecs, Retrying with a recovery, retry cnt = 3
Sun Nov 14 11:58:34 2021: ERROR: ALSA: poll time out, polled for 7999113 usecs, Retrying with a recovery, retry cnt = 4
Sun Nov 14 11:58:42 2021: ERROR: ALSA: poll time out, polled for 7999084 usecs, Retrying with a recovery, retry cnt = 5
Sun Nov 14 11:58:50 2021: ERROR: ALSA: poll time out, polled for 7999049 usecs, Reached max retry cnt = 5, Exiting
Sun Nov 14 11:58:50 2021: ERROR: JackAudioDriver::ProcessAsync: read error, stopping...
Sun Nov 14 11:58:50 2021: Jack: JackPosixThread::ThreadHandler : exit
I have to stop and start JACK to recover from it. Only the first line saying: snd_oxfw fw1.0: XRUN: pcmC2D0p:0
is understandable for me. XRUN occured in playback stream in snd_firewire_lib
. Does it tell you anything more? Is it functions and registers dump?
EDIT: In general the XRUN counter was at 3 XRUNs. Two of them were normal XRUNs due to low buffer/periods and one was the one that is making the JACK process hang. Full log from dmesg is:
[ 370.275999] ALSA: PCM: [Q] Lost interrupts?: (stream=1, delta=408, new_hw_ptr=16326816, old_hw_ptr=16326408)
[ 370.276007] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G OE 5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[ 370.276011] Hardware name: [...]
[ 370.276013] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[ 370.276026] Call Trace:
[ 370.276031] dump_stack+0x76/0x94
[ 370.276040] snd_pcm_update_hw_ptr0.cold+0x78/0xe9 [snd_pcm]
[ 370.276058] snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[ 370.276071] process_one_work+0x1ce/0x440
[ 370.276079] ? process_one_work+0x440/0x440
[ 370.276084] worker_thread+0x50/0x3b0
[ 370.276089] ? process_one_work+0x440/0x440
[ 370.276093] kthread+0x186/0x1a0
[ 370.276096] ? __kthread_parkme+0xa0/0xa0
[ 370.276100] ret_from_fork+0x22/0x30
[ 465.098697] ALSA: PCM: [Q] Lost interrupts?: (stream=0, delta=768, new_hw_ptr=20878080, old_hw_ptr=20877312)
[ 465.098704] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G OE 5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[ 465.098707] Hardware name: [...]
[ 465.098708] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[ 465.098718] Call Trace:
[ 465.098722] dump_stack+0x76/0x94
[ 465.098729] snd_pcm_update_hw_ptr0.cold+0x78/0xe9 [snd_pcm]
[ 465.098740] ? _raw_spin_unlock_irq+0x1d/0x50
[ 465.098744] ? finish_task_switch.isra.0+0xd7/0x310
[ 465.098748] ? __switch_to+0x84/0x440
[ 465.098751] snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[ 465.098760] process_one_work+0x1ce/0x440
[ 465.098764] ? process_one_work+0x440/0x440
[ 465.098767] worker_thread+0x50/0x3b0
[ 465.098769] ? process_one_work+0x440/0x440
[ 465.098772] kthread+0x186/0x1a0
[ 465.098774] ? __kthread_parkme+0xa0/0xa0
[ 465.098776] ret_from_fork+0x22/0x30
[ 465.098803] snd_oxfw fw1.0: XRUN: pcmC2D0p:0
[ 465.098805] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G OE 5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[ 465.098806] Hardware name: [...]
[ 465.098807] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[ 465.098811] Call Trace:
[ 465.098812] dump_stack+0x76/0x94
[ 465.098815] __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
[ 465.098823] snd_pcm_update_state+0x107/0x110 [snd_pcm]
[ 465.098830] snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
[ 465.098838] ? __switch_to+0x84/0x440
[ 465.098840] snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[ 465.098847] process_one_work+0x1ce/0x440
[ 465.098850] ? process_one_work+0x440/0x440
[ 465.098853] worker_thread+0x50/0x3b0
[ 465.098856] ? process_one_work+0x440/0x440
[ 465.098858] kthread+0x186/0x1a0
[ 465.098860] ? __kthread_parkme+0xa0/0xa0
[ 465.098862] ret_from_fork+0x22/0x30
[ 825.780497] ALSA: PCM: [Q] Lost interrupts?: (stream=1, delta=512, new_hw_ptr=512, old_hw_ptr=0)
[ 825.780506] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G OE 5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[ 825.780511] Hardware name: [...]
[ 825.780514] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[ 825.780529] Call Trace:
[ 825.780535] dump_stack+0x76/0x94
[ 825.780546] snd_pcm_update_hw_ptr0.cold+0x78/0xe9 [snd_pcm]
[ 825.780565] ? __switch_to+0x84/0x440
[ 825.780572] snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[ 825.780588] process_one_work+0x1ce/0x440
[ 825.780595] ? process_one_work+0x440/0x440
[ 825.780600] worker_thread+0x50/0x3b0
[ 825.780606] ? process_one_work+0x440/0x440
[ 825.780611] kthread+0x186/0x1a0
[ 825.780615] ? __kthread_parkme+0xa0/0xa0
[ 825.780619] ret_from_fork+0x22/0x30
I've diffed the dumps and it shows the following:
areyouloco@studio:/tmp$ diff one two
6,8c6
< __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
< snd_pcm_update_state+0x107/0x110 [snd_pcm]
< snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
---
> snd_pcm_update_hw_ptr0.cold+0x78/0xe9 [snd_pcm]
So it happen again:
EDIT: And again. I can totally reproduce it by letting it play. Same thing in the JACK/DMESG.
[ 3253.832489] snd_oxfw fw1.0: XRUN: pcmC2D0p:0
[ 3253.832490] CPU: 1 PID: 63 Comm: kworker/1:1H Tainted: G OE 5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[ 3253.832492] Hardware name: [...]
[ 3253.832492] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[ 3253.832497] Call Trace:
[ 3253.832497] dump_stack+0x76/0x94
[ 3253.832500] __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
[ 3253.832508] snd_pcm_update_state+0x107/0x110 [snd_pcm]
[ 3253.832516] snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
[ 3253.832523] ? __switch_to+0x84/0x440
[ 3253.832526] snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[ 3253.832533] process_one_work+0x1ce/0x440
[ 3253.832536] ? process_one_work+0x440/0x440
[ 3253.832538] worker_thread+0x50/0x3b0
[ 3253.832541] ? process_one_work+0x440/0x440
[ 3253.832543] kthread+0x186/0x1a0
[ 3253.832545] ? __kthread_parkme+0xa0/0xa0
[ 3253.832547] ret_from_fork+0x22/0x30
Great, thanks for the debug log.
As long as I can see, it is usual underrun issue for PCM playback substream.
Call Trace:
dump_stack+0x76/0x94
__snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
snd_pcm_update_state+0x107/0x110 [snd_pcm]
snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
...
The back trace tells us that the buffer of PCM playback substream has more PCM frames than stop_threshold sw parameter:
int snd_pcm_update_state(struct snd_pcm_substream *substream,
struct snd_pcm_runtime *runtime)
{
snd_pcm_uframes_t avail;
avail = snd_pcm_avail(substream);
....
if (runtime->status->state == SNDRV_PCM_STATE_DRAINING) {
...
} else {
if (avail >= runtime->stop_threshold) {
__snd_pcm_xrun(substream);
return -EPIPE;
}
}
....
return 0;
}
We can guess that the process of jackd don't wake up to put enough PCM frames into the buffer before handling the period elapsed event. Usually the process should awakened by the period elapsed event before the one.
In your case, to handle PCM frames in isochronous transmit (IT) context, 5 contexts run in below order expectedly:
processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context
hw IRQ context
->IRQ handler is invoked for the IT context
-> tasklet is scheduled for the IT context
soft IRQ context
->tasklet is invoked for the IT context
-> workqueue is scheduled for period elapsed event
workqueue context
->call of snd_pcm_period_elapsed()
->task for jackd is scheduled to wake up
process context
->jackd thread wakes up and fill the buffer with PCM frames
(for next IRQ)
If the order was not kept, XRUN occurs. In this point, I remember timing issue reported to upstream:
As long as I know, with -rt patchset, the tasklet context run in the other workqueue context. The above order becomes below:
processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context
hw IRQ context
-> workqueue is scheduled for hw IRQ
workqueue context (A)
->workqueue is invoked for the hw IIRQ
-> workqueue is scheduled for the IT context
workqueue context (B)
->workqueue is invoked for the IT context
-> workqueue is scheduled for period elapsed event
workqueue context (C)
->call of snd_pcm_period_elapsed()
->task for jackd is scheduled to wake up
process context
->jackd thread wakes up and fill the buffer with PCM frames
(for next IRQ)
From task scheduler, the priority of successive three workqueue context (A, B and C) is not deterministic, thus it's probable not to keep the above order. In the worst case, next workqueue context (A) runs earlier than previous workqueue context (C).
Hm. Anyway, I have never heard the similar issue. Would I ask you to do the same test with non-rt kernel? I'd like to check the same issue is regenerated regardless of -rt patches.
I note that my patch for ALSA PCM core was merged to v5.14, then the above order is shorten in stock kernel:
processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context
hw IRQ context
->IRQ handler is invoked for the IT context
-> tasklet is scheduled for the IT context
soft IRQ context
->tasklet is invoked for the IT context
-> call of snd_pcm_period_elapsed()
-> task for jackd is scheduled to wake up
process context
->jackd thread wakes up and fill the buffer with PCM frames
(for next IRQ)
But in the backport driver codes, the above is disabled due to backward compatibility.
Thanks
Right now I am doing the tests with stock kernel (5.14.0-4-amd64) and backported drivers. Next I will try stock kernel and stock drivers and will post result. The stock kernel is build without XRUN debugging options but I can tell if the JACKdbus process will hang or not.
Also as I stated before. The debug log is consisting more than one xrun logged and only one of them is critical that makes JACK process hang.
Let me test it for a few days and I will be back with the results.
So I think I figured it out. Seems like my script audio-tweaks.sh is making JACK hang somehow. Script is following:
#!/bin/bash
sudo modprobe -r ppdev # I don't have a parallel port
sudo modprobe -r lp # I don't use a printer when making music
sudo modprobe -r uvcvideo # I don't use a webcam when making music
sudo modprobe snd-hrtimer # Load the ALSA high res timer for my MIDI stuff
# "open up" the PCI bus by allowing fairly long bursts for all devices, increasing performance
sudo setpci -v -s "*:*.*" latency_timer=b0
# maximise latency for FireWire (IEEE 1394): Ricoh Co Ltd R5C832 PCIe IEEE 1394 Controller (rev 04)
# thus allowing more data per PCI transfer and minimizing xruns
sudo setpci -v -s 0d:00.3 latency_timer=ff
# ditto for the onboard Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Control>
sudo setpci -v -s 00:1b.0 latency_timer=ff
sudo cpufreq-set -g performance -c 0
sudo cpufreq-set -g performance -c 1
When I trigger the script by hand on kernel 5.14.0-4-amd64 with your backported driver I have the same exact XRUN that hangs the JACKdbus proccess. Do you have any clue why is it happening? Or do I need the script at all?
EDIT: I am not sure but I think on my RT kernel it didn't make JACK hang. I am building one more kernel to test and will post results.
I've build a new kernel which is lowlatency but not RT (5.12.0-rc8-lowlatency-fw) and it seems like this is the best kernel so far I've tested. No hangs in hours of playback. Will see how it goes but most likely the bug is only present on RT kernels.
EDIT: I was wrong;( The exactly one XRUN hang just happened on mentioned above kernel. I omitted RT patches so it's non-realtime kernel but with threaded IRQ and 1000Hz timer. I've build it with XRUN debug options will see if it's the same XRUN debug log...
[11368.138248] snd_oxfw fw1.0: XRUN: pcmC2D0p:0
[11368.138250] CPU: 0 PID: 130 Comm: kworker/0:1H Tainted: G OE 5.12.0-rc8-lowlatency-fw+ #1
[11368.138253] Hardware name: [...]
[11368.138254] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[11368.138259] Call Trace:
[11368.138261] dump_stack+0x76/0x94
[11368.138264] __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
[11368.138273] snd_pcm_update_state+0x107/0x110 [snd_pcm]
[11368.138281] snd_pcm_update_hw_ptr0+0x213/0x770 [snd_pcm]
[11368.138289] ? __switch_to+0x84/0x480
[11368.138292] snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[11368.138300] process_one_work+0x1e3/0x3b0
[11368.138302] ? process_one_work+0x3b0/0x3b0
[11368.138304] worker_thread+0x50/0x3b0
[11368.138307] ? process_one_work+0x3b0/0x3b0
[11368.138309] kthread+0x142/0x160
[11368.138312] ? kthread_associate_blkcg+0xc0/0xc0
[11368.138315] ret_from_fork+0x22/0x30
So theory that it happens only on RT kernels seems to be not true.
EDIT: So the hang happened also on stock Debian kernel and your driver. So I think it's more a back-ported driver specific not kernel configuration related. I tried 3 different kernels with your back-ported driver: stock, low-latency and real-time and on all of them hang happens in different time matter but still happens.
I note that my patch for ALSA PCM core was merged to v5.14, then the above order is shorten in stock kernel:
* [git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/sound/core?id=47271b1b98c9](https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/sound/core?id=47271b1b98c9)
processor receives IRQ by 1394 OHCI controller for IT context ->hw IRQ is queued for the IT context hw IRQ context ->IRQ handler is invoked for the IT context -> tasklet is scheduled for the IT context soft IRQ context ->tasklet is invoked for the IT context -> call of snd_pcm_period_elapsed() -> task for jackd is scheduled to wake up process context ->jackd thread wakes up and fill the buffer with PCM frames (for next IRQ)
But in the backport driver codes, the above is disabled due to backward compatibility.
How could I enable this feature to test it out in the back-ported driver? Could you make a patch? I am trying stock kernel module of kernel version 5.14.0-4-amd64 so it should be enabled there, right?
How could I enable this feature to test it out in the back-ported driver? Could you make a patch? I am trying stock kernel module of kernel version 5.14.0-4-amd64 so it should be enabled there, right?
This patch is what you expected:
In Linux kernel 5.13 or before, the kernel symbol, snd_pcm_period_elapsed_under_stream_lock()
,
is not found since it was introduced in 5.14.
For your information, relevant change of ALSA PCM core was done by the patch below:
So I tried stock Debian kernel (5.14.0-4-amd64) with stock kernel modules and the hang happened as well. This kernel produces more XRUNs in general. I will try to reproduce it once more to be sure. And then try combination of mentioned commits. Is changing the quirks parameter of firewire-ohci module going to make it something different?
EDIT: I reproduced the hang on stock kernel with stock modules.
parm: quirks:Chip quirks (default = 0, nonatomic cycle timer = 0x1, reset packet generation = 0x2, AR/selfID endianness = 0x4, no 1394a enhancements = 0x8, disable MSI = 0x10, TI SLLZ059 erratum = 0x20, IR wake unreliable = 0x40) (int)
What values should I try? Debian default is 0x50 I am trying value 0.
What values should I try? Debian default is 0x50 I am trying value 0.
In my opinion, it is not a good approach to expand test conditions to diagnose situation.
When diagnosing situation, I think three points are important as methodology:
It's a kind of methodology. Unless fulfill the above items, final evaluation is not reliable.Unless getting reliable evaluation, nothing can be going to be changed for better direction.
In your case (and the most of jackd
users), the methodology is almost the same and rigid; removal of XRUN
in jackd which they found by change system and runtime configuration variously. However, it's far from the behaviour
to fix system and software runtime, since current implementation of jackd is neither reference software for digital
audio processing nor ideal ALSA application for modern hardware.
Well, tentatively, I let you decide your system for testing without any modification, then gather facts; e.g. logs for various demands to diagnosis. The fact should not be 'I still encounter XRUN in jackd process'. The configuration and modification to system and runtime should be public as much as possible so that the person can diagnose situation as deep as possible. I showed relevant context before, they are:
1. processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context
2. hw IRQ context
->IRQ handler is invoked for the IT context
-> tasklet is scheduled for the IT context
3. soft IRQ context
->tasklet is invoked for the IT context
-> workqueue is scheduled for period elapsed event
4. workqueue context
->call of snd_pcm_period_elapsed()
->task for jackd is scheduled to wake up
5. process context
->jackd thread wakes up and fill the buffer with PCM frames
(for next IRQ)
The 1st context is blind from system since it is the part of hardware protocol (PCI Express), thus we can not get any log via system.
The 2nd context is done by processor as service for system software. Linux 1394 OHCI controller
(firewire-ohci kernel module) includes IRQ handler invoked in the context. You can get system log by
configuring module debug
parameter.
$ modinfo firewire-ohci
filename: /lib/modules/5.13.0-21-generic/kernel/drivers/firewire/firewire-ohci.ko
...
parm: debug:Verbose logging (default = 0, AT/AR events = 1, self-IDs = 2, IRQs = 4, busReset events = 8, or a combination, or all = -1) (int)
The 3rd context is done by system, typically in the end of IRQ handling. ALSA IEC 61883-1/6 packet streaming engine
(snd-firewire-lib kernel module) includes tasklet code to process isochronous packet. The state of packet processing
can be visible by Linux Tracepoints framework. The tracepoint event is snd_firewire_lib:amdtp_packet
.
The 4th context is done by kernel thread for system high priority workqueue. The 5th context is for user process. They are analyzed by the other Tracepoints events.
In my opinion, comparison between 2nd and 3rd contexts gives us perspective for system IRQ invocation at XRUN. The comparison between 3rd, 4th, and 5th contexts gives us perspective for task scheduling at XRUN.
Ok. Thanks for explanation but I think I will need some more help to get into that methodology.
For now from what I understood I've set debug parameter of firewire-ohci
to -1
. And I will try to catch the bug and post logs from dmesg
. So dmesg
is just filled with:
firewire_ohci 0000:0d:00.3: IRQ 00000040 IT
EDIT: I've catch the bug again and there is nothing in dmesg. Where should I look now? Should I dump the packets when the hang happens? Will the dump tell you something interesting?
The tracepoint event is snd_firewire_lib:amdtp_packet.
Could you help me a bit with the other contexts? Especially the 3rd and 4th mentioned.
For 3rd context, you did already (tracepoints in snd_firewire_lib module). For 4th context as well (XRUN debug log enabled via kernel configuration and procfs node).
So what else could I do to help you debug that? The issue is critical in the long sessions.
So what else could I do to help you debug that? The issue is critical in the long sessions.
In my opinion, it's a good start point to test with axfer(1) in alsa-utils. It's rework of aplay(1) and manage to be sample program for all of operations supported in ALSA PCM interface.
Ok so I do have it installed but it requires wav-a-like file to play and I have no wav that are long enough to capture a problem. I will try to download some mix and convert it to wav. How do I use axfer transfer playback
after all?
EDIT: Or maybe I could do capture to some file and the bug will be there at some point.
For now I use:
$ axfer transfer capture --verbose --rate=48 --file-type=wav --xfer-type=libasound --device=hw:CARD=O1640i,DEV=0 --fatal-errors > capture.wav
Hardware PCM card 2 'Onyx 1640i' device 0 subdevice 0
Its setup is:
stream : CAPTURE
access : RW_INTERLEAVED
format : S32_LE
subformat : STD
channels : 16
rate : 48000
exact rate : 48000 (48000/1)
msbits : 24
buffer_size : 4096
period_size : 1024
period_time : 21333
tstamp_mode : NONE
tstamp_type : MONOTONIC
period_step : 1
avail_min : 1024
period_event : 0
start_threshold : 1
stop_threshold : 4096
silence_threshold: 0
silence_size : 0
boundary : 4611686018427387904
appl_ptr : 0
hw_ptr : 0
Scheduling model:
irq
CAPTURE: Format 'Signed 32 bit Little Endian', Rate 48000 Hz, Channels 16
Fail to process frames: Broken pipe
CAPTURE: Expected 67108863frames, Actual 719904frames
How do I know that the bug occurred? This command requires jack to be stopped. Do I need to sample all 16 channels or 1 is just enough?
access : RW_INTERLEAVED
The mmap/interleaved
operation is preferable to emulate the runtime of practical application, thus add -M
or --mmap
option.
When using --verbose
(-v
) option several times, the runtime of axfer dumps detail information, thus add -vv
.
Anyway I recommend to study the standard runtime of ALSA PCM applications by poring through manual of axfer-transfer
several times, especially for SCHEDULING MODEL
and DESIGN
sections.
I've read the whole axfer manual two times.
$ axfer transfer capture --verbose --verbose --rate=48 --file-type=wav --xfer-type=libasound --device=hw:CARD=O1640i,DEV=0 --mmap capture.wav
attributes for mapped page frame:
sample number: 0
address: 0x7f07b59cd000
bits for offset: 0
bits/frame: 512
sample number: 1
address: 0x7f07b59cd000
bits for offset: 32
bits/frame: 512
sample number: 2
address: 0x7f07b59cd000
bits for offset: 64
bits/frame: 512
sample number: 3
address: 0x7f07b59cd000
bits for offset: 96
bits/frame: 512
sample number: 4
address: 0x7f07b59cd000
bits for offset: 128
bits/frame: 512
sample number: 5
address: 0x7f07b59cd000
bits for offset: 160
bits/frame: 512
sample number: 6
address: 0x7f07b59cd000
bits for offset: 192
bits/frame: 512
sample number: 7
address: 0x7f07b59cd000
bits for offset: 224
bits/frame: 512
sample number: 8
address: 0x7f07b59cd000
bits for offset: 256
bits/frame: 512
sample number: 9
address: 0x7f07b59cd000
bits for offset: 288
bits/frame: 512
sample number: 10
address: 0x7f07b59cd000
bits for offset: 320
bits/frame: 512
sample number: 11
address: 0x7f07b59cd000
bits for offset: 352
bits/frame: 512
sample number: 12
address: 0x7f07b59cd000
bits for offset: 384
bits/frame: 512
sample number: 13
address: 0x7f07b59cd000
bits for offset: 416
bits/frame: 512
sample number: 14
address: 0x7f07b59cd000
bits for offset: 448
bits/frame: 512
sample number: 15
address: 0x7f07b59cd000
bits for offset: 480
bits/frame: 512
Hardware PCM card 2 'Onyx 1640i' device 0 subdevice 0
Its setup is:
stream : CAPTURE
access : MMAP_INTERLEAVED
format : S32_LE
subformat : STD
channels : 16
rate : 48000
exact rate : 48000 (48000/1)
msbits : 24
buffer_size : 4096
period_size : 1024
period_time : 21333
tstamp_mode : NONE
tstamp_type : MONOTONIC
period_step : 1
avail_min : 1024
period_event : 0
start_threshold : 1
stop_threshold : 4096
silence_threshold: 0
silence_size : 0
boundary : 4611686018427387904
appl_ptr : 0
hw_ptr : 0
Scheduling model:
irq
Waiter type:
default
Transfer: libasound
access: MMAP_INTERLEAVED
sample format: S32_LE
bytes/sample: 4
samples/frame: 16
frames/second: 48000
frames/buffer: 4096
Container: builder
format: riff/wave
sample format: S32_LE
bytes/sample: 4
samples/frame: 16
frames/second: 48000
max frames: 67108863
Mapper: demuxer
target: single
access: MMAP_INTERLEAVED
bytes/sample: 4
samples/frame: 16
frames/buffer: 4096
CAPTURE: Format 'Signed 32 bit Little Endian', Rate 48000 Hz, Channels 16
Now it seems to run and with --fatal-errors
it will stop at XRUN right? Will the capture.wav
be any useful? I cannot really see the advantage of using axfer
(yet!).
EDIT: I could somehow make it play some online radio and when it hangs I will see it right?
I was not using hardware for some time and today again it hanged. @takaswie could you elaborate more on why would I use axfer to hunt that bug?
Hi @takaswie I just checked long time on a bit different setup. I tried on desktop computer with better FireWire controller (chipset XIO2213B) and the hang doesnt happen at all. There are some xruns time to time but unnoticable. I think we can close this issue and blame faulty FireWire controller on Thinkpad T420.
I just checked long time on a bit different setup. I tried on desktop computer with better FireWire controller (chipset XIO2213B) and the hang doesnt happen at all.
Good to hear it ;)
Hi,
I am suffering from another issue that is playing audio via PulseAudio bridge and after around 45min-1h the JACK instance just hangs producing exactly once XRUN. The only way to recover is to restart JACK. The issue was present in the stock kernel (5.12.0-rc3) module and I thought with your recent patches that is was gone but no. I am using your recent driver and it's affected as well.
First I thought its PulseAudio issue. But then I experimented a bit and I killed PulseAudio and played music with Mixxx directly to JACK with ALSA backend and the issue still persist.
Posting logs when the issue will pop up...
EDIT: OK Now when I try to debug it runs rock solid. But I am sure it will happen after some time.