Mackie Onyx 1640i: JACK instance hangs randomly after long playback.

AreYouLoco commented 2 years ago

Hi,

I am suffering from another issue that is playing audio via PulseAudio bridge and after around 45min-1h the JACK instance just hangs producing exactly once XRUN. The only way to recover is to restart JACK. The issue was present in the stock kernel (5.12.0-rc3) module and I thought with your recent patches that is was gone but no. I am using your recent driver and it's affected as well.

First I thought its PulseAudio issue. But then I experimented a bit and I killed PulseAudio and played music with Mixxx directly to JACK with ALSA backend and the issue still persist.

Posting logs when the issue will pop up...

EDIT: OK Now when I try to debug it runs rock solid. But I am sure it will happen after some time.

AreYouLoco commented 2 years ago

Cannot reproduce it. Closing for now.

EDIT: It happened again. Re-opening.

AreYouLoco commented 2 years ago

Log from jack when hang happens:

Thu Nov 11 22:09:27 2021: Jack: **** alsa_pcm: xrun of at least 0.026 msecs
Thu Nov 11 22:09:27 2021: Jack: Repreparing capture
Thu Nov 11 22:09:27 2021: Jack: Repreparing playback
Thu Nov 11 22:09:27 2021: Jack: ALSA XRun wait_status = 0
Thu Nov 11 22:09:27 2021: Jack: JackRequest::Notification
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:27 2021: Jack: JackClient::ClientNotify ref = 2 name = dbusapi notify = 3
Thu Nov 11 22:09:27 2021: Jack: JackClient::kXRunCallback
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:27 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Thu Nov 11 22:09:31 2021: ERROR: ALSA: poll time out, polled for 3998928 usecs, Retrying with a recovery, retry cnt = 1
Thu Nov 11 22:09:35 2021: ERROR: ALSA: poll time out, polled for 3998947 usecs, Retrying with a recovery, retry cnt = 2
Thu Nov 11 22:09:39 2021: ERROR: ALSA: poll time out, polled for 3998958 usecs, Retrying with a recovery, retry cnt = 3
Thu Nov 11 22:09:43 2021: ERROR: ALSA: poll time out, polled for 3998917 usecs, Retrying with a recovery, retry cnt = 4
Thu Nov 11 22:09:47 2021: ERROR: ALSA: poll time out, polled for 3998890 usecs, Retrying with a recovery, retry cnt = 5
Thu Nov 11 22:09:51 2021: ERROR: ALSA: poll time out, polled for 3998890 usecs, Reached max retry cnt = 5, Exiting
Thu Nov 11 22:09:51 2021: ERROR: JackAudioDriver::ProcessAsync: read error, stopping...
Thu Nov 11 22:09:51 2021: Jack: JackPosixThread::ThreadHandler : exit

takaswie commented 2 years ago

Thanks for the log.

When seeing the issue, you can see something in system log?

AreYouLoco commented 2 years ago

No. I checked dmesg but there was nothing there.

EDIT: Will try to reproduce it and check again maybe I missed something but I don't think so.

Following the: https://www.alsa-project.org/main/index.php/XRUN_Debug I have a kernel with these options compiled in.

areyouloco@studio:~$ grep -e" CONFIG_SND_PCM_XRUN_DEBUG" -e "CONFIG_SND_VERBOSE_PROCFS" -e "CONFIG_SND_DEBUG" /boot/config-`uname -r`
CONFIG_SND_VERBOSE_PROCFS=y
CONFIG_SND_DEBUG=y
CONFIG_SND_DEBUG_VERBOSE=y

~~How do I proceed to debug the XRUN?~~ Reading this: https://www.kernel.org/doc/html/v5.14-rc6/sound/designs/procfile.html#pcm-proc-files I've set: # echo 11 > /proc/asound/card0/pcm0p/xrun_debug and # echo 11 > /proc/asound/card0/pcm0c/xrun_debug but for card2 according to /proc/asound/cards and hunting the bug again

I also tried setting ridiculously low buffer/period size and sometimes some XRUNs occur but nothing a JACK can't recover from. But this exactly one XRUN I am talking about didn't happen, yet! From what I see now it's not related to buffer/period size. Happens randomly after long playback. It will be hard to diagnose.

AreYouLoco commented 2 years ago

And now when I try to hunt the bug it doesn't happen;/ But I am sure at some point it will blow up. I let it run and will report when it happens.

EDIT: I've let it play over night at 128samples/3periods and 0 XRUNS so far. I really don't know when does it happen. Maybe it's related to RT kernel.

AreYouLoco commented 2 years ago

I got it! Now the JACK server hanged and what I got in dmesg is:

[  465.098803] snd_oxfw fw1.0: XRUN: pcmC2D0p:0
[  465.098805] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G  OE  5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[  465.098806] Hardware name: [...]
[  465.098807] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[  465.098811] Call Trace:
[  465.098812]  dump_stack+0x76/0x94
[  465.098815]  __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
[  465.098823]  snd_pcm_update_state+0x107/0x110 [snd_pcm]
[  465.098830]  snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
[  465.098838]  ? __switch_to+0x84/0x440
[  465.098840]  snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[  465.098847]  process_one_work+0x1ce/0x440
[  465.098850]  ? process_one_work+0x440/0x440
[  465.098853]  worker_thread+0x50/0x3b0
[  465.098856]  ? process_one_work+0x440/0x440
[  465.098858]  kthread+0x186/0x1a0
[  465.098860]  ? __kthread_parkme+0xa0/0xa0
[  465.098862]  ret_from_fork+0x22/0x30

And again JACK log says the same as posted above:


Sun Nov 14 11:58:02 2021: Jack: **** alsa_pcm: xrun of at least 0.098 msecs
Sun Nov 14 11:58:02 2021: Jack: Repreparing capture
Sun Nov 14 11:58:02 2021: Jack: Repreparing playback
Sun Nov 14 11:58:02 2021: Jack: ALSA XRun wait_status = 0
Sun Nov 14 11:58:02 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Sun Nov 14 11:58:02 2021: Jack: JackClient::ClientNotify ref = 2 name = dbusapi notify = 3
Sun Nov 14 11:58:02 2021: Jack: JackClient::kXRunCallback
Sun Nov 14 11:58:02 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Sun Nov 14 11:58:02 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Sun Nov 14 11:58:02 2021: Jack: JackEngine::ClientNotify: no callback for notification = 3
Sun Nov 14 11:58:10 2021: ERROR: ALSA: poll time out, polled for 7999047 usecs, Retrying with a recovery, retry cnt = 1
Sun Nov 14 11:58:18 2021: ERROR: ALSA: poll time out, polled for 7999050 usecs, Retrying with a recovery, retry cnt = 2
Sun Nov 14 11:58:26 2021: ERROR: ALSA: poll time out, polled for 7999065 usecs, Retrying with a recovery, retry cnt = 3
Sun Nov 14 11:58:34 2021: ERROR: ALSA: poll time out, polled for 7999113 usecs, Retrying with a recovery, retry cnt = 4
Sun Nov 14 11:58:42 2021: ERROR: ALSA: poll time out, polled for 7999084 usecs, Retrying with a recovery, retry cnt = 5
Sun Nov 14 11:58:50 2021: ERROR: ALSA: poll time out, polled for 7999049 usecs, Reached max retry cnt = 5, Exiting
Sun Nov 14 11:58:50 2021: ERROR: JackAudioDriver::ProcessAsync: read error, stopping...
Sun Nov 14 11:58:50 2021: Jack: JackPosixThread::ThreadHandler : exit

I have to stop and start JACK to recover from it. Only the first line saying: snd_oxfw fw1.0: XRUN: pcmC2D0p:0 is understandable for me. XRUN occured in playback stream in snd_firewire_lib. Does it tell you anything more? Is it functions and registers dump?

EDIT: In general the XRUN counter was at 3 XRUNs. Two of them were normal XRUNs due to low buffer/periods and one was the one that is making the JACK process hang. Full log from dmesg is:

[  370.275999] ALSA: PCM: [Q] Lost interrupts?: (stream=1, delta=408, new_hw_ptr=16326816, old_hw_ptr=16326408)
[  370.276007] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G           OE     5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[  370.276011] Hardware name: [...]
[  370.276013] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[  370.276026] Call Trace:
[  370.276031]  dump_stack+0x76/0x94
[  370.276040]  snd_pcm_update_hw_ptr0.cold+0x78/0xe9 [snd_pcm]
[  370.276058]  snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[  370.276071]  process_one_work+0x1ce/0x440
[  370.276079]  ? process_one_work+0x440/0x440
[  370.276084]  worker_thread+0x50/0x3b0
[  370.276089]  ? process_one_work+0x440/0x440
[  370.276093]  kthread+0x186/0x1a0
[  370.276096]  ? __kthread_parkme+0xa0/0xa0
[  370.276100]  ret_from_fork+0x22/0x30
[  465.098697] ALSA: PCM: [Q] Lost interrupts?: (stream=0, delta=768, new_hw_ptr=20878080, old_hw_ptr=20877312)
[  465.098704] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G           OE     5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[  465.098707] Hardware name: [...]
[  465.098708] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[  465.098718] Call Trace:
[  465.098722]  dump_stack+0x76/0x94
[  465.098729]  snd_pcm_update_hw_ptr0.cold+0x78/0xe9 [snd_pcm]
[  465.098740]  ? _raw_spin_unlock_irq+0x1d/0x50
[  465.098744]  ? finish_task_switch.isra.0+0xd7/0x310
[  465.098748]  ? __switch_to+0x84/0x440
[  465.098751]  snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[  465.098760]  process_one_work+0x1ce/0x440
[  465.098764]  ? process_one_work+0x440/0x440
[  465.098767]  worker_thread+0x50/0x3b0
[  465.098769]  ? process_one_work+0x440/0x440
[  465.098772]  kthread+0x186/0x1a0
[  465.098774]  ? __kthread_parkme+0xa0/0xa0
[  465.098776]  ret_from_fork+0x22/0x30
[  465.098803] snd_oxfw fw1.0: XRUN: pcmC2D0p:0
[  465.098805] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G           OE     5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[  465.098806] Hardware name: [...]
[  465.098807] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[  465.098811] Call Trace:
[  465.098812]  dump_stack+0x76/0x94
[  465.098815]  __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
[  465.098823]  snd_pcm_update_state+0x107/0x110 [snd_pcm]
[  465.098830]  snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
[  465.098838]  ? __switch_to+0x84/0x440
[  465.098840]  snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[  465.098847]  process_one_work+0x1ce/0x440
[  465.098850]  ? process_one_work+0x440/0x440
[  465.098853]  worker_thread+0x50/0x3b0
[  465.098856]  ? process_one_work+0x440/0x440
[  465.098858]  kthread+0x186/0x1a0
[  465.098860]  ? __kthread_parkme+0xa0/0xa0
[  465.098862]  ret_from_fork+0x22/0x30
[  825.780497] ALSA: PCM: [Q] Lost interrupts?: (stream=1, delta=512, new_hw_ptr=512, old_hw_ptr=0)
[  825.780506] CPU: 0 PID: 70 Comm: kworker/0:1H Tainted: G           OE     5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[  825.780511] Hardware name: [...]
[  825.780514] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[  825.780529] Call Trace:
[  825.780535]  dump_stack+0x76/0x94
[  825.780546]  snd_pcm_update_hw_ptr0.cold+0x78/0xe9 [snd_pcm]
[  825.780565]  ? __switch_to+0x84/0x440
[  825.780572]  snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[  825.780588]  process_one_work+0x1ce/0x440
[  825.780595]  ? process_one_work+0x440/0x440
[  825.780600]  worker_thread+0x50/0x3b0
[  825.780606]  ? process_one_work+0x440/0x440
[  825.780611]  kthread+0x186/0x1a0
[  825.780615]  ? __kthread_parkme+0xa0/0xa0
[  825.780619]  ret_from_fork+0x22/0x30

I've diffed the dumps and it shows the following:

areyouloco@studio:/tmp$ diff one two
6,8c6
< __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
< snd_pcm_update_state+0x107/0x110 [snd_pcm]
< snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
---
> snd_pcm_update_hw_ptr0.cold+0x78/0xe9 [snd_pcm]

AreYouLoco commented 2 years ago

So it happen again:

EDIT: And again. I can totally reproduce it by letting it play. Same thing in the JACK/DMESG.

[ 3253.832489] snd_oxfw fw1.0: XRUN: pcmC2D0p:0
[ 3253.832490] CPU: 1 PID: 63 Comm: kworker/1:1H Tainted: G           OE     5.12.0-rc3-rt3-fw-00003-g608d0b4b4952 #1
[ 3253.832492] Hardware name: [...]
[ 3253.832492] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[ 3253.832497] Call Trace:
[ 3253.832497]  dump_stack+0x76/0x94
[ 3253.832500]  __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
[ 3253.832508]  snd_pcm_update_state+0x107/0x110 [snd_pcm]
[ 3253.832516]  snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
[ 3253.832523]  ? __switch_to+0x84/0x440
[ 3253.832526]  snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[ 3253.832533]  process_one_work+0x1ce/0x440
[ 3253.832536]  ? process_one_work+0x440/0x440
[ 3253.832538]  worker_thread+0x50/0x3b0
[ 3253.832541]  ? process_one_work+0x440/0x440
[ 3253.832543]  kthread+0x186/0x1a0
[ 3253.832545]  ? __kthread_parkme+0xa0/0xa0
[ 3253.832547]  ret_from_fork+0x22/0x30

takaswie commented 2 years ago

Great, thanks for the debug log.

As long as I can see, it is usual underrun issue for PCM playback substream.

Call Trace:
dump_stack+0x76/0x94
__snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
snd_pcm_update_state+0x107/0x110 [snd_pcm]
snd_pcm_update_hw_ptr0+0x21c/0x850 [snd_pcm]
...

The back trace tells us that the buffer of PCM playback substream has more PCM frames than stop_threshold sw parameter:

int snd_pcm_update_state(struct snd_pcm_substream *substream,
                         struct snd_pcm_runtime *runtime)
{
    snd_pcm_uframes_t avail;

    avail = snd_pcm_avail(substream);
    ....
    if (runtime->status->state == SNDRV_PCM_STATE_DRAINING) {
        ...
    } else {
        if (avail >= runtime->stop_threshold) {
            __snd_pcm_xrun(substream);
            return -EPIPE;
        }
    }
    ....
    return 0;
}

We can guess that the process of jackd don't wake up to put enough PCM frames into the buffer before handling the period elapsed event. Usually the process should awakened by the period elapsed event before the one.

In your case, to handle PCM frames in isochronous transmit (IT) context, 5 contexts run in below order expectedly:

processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context

hw IRQ context
->IRQ handler is invoked for the IT context
  -> tasklet is scheduled for the IT context

soft IRQ context
->tasklet is invoked for the IT context
  -> workqueue is scheduled for period elapsed event

workqueue context
->call of snd_pcm_period_elapsed()
  ->task for jackd is scheduled to wake up

process context
->jackd thread wakes up and fill the buffer with PCM frames

(for next IRQ)

If the order was not kept, XRUN occurs. In this point, I remember timing issue reported to upstream:

https://lore.kernel.org/linux-rt-users/YAwPoaUZ1gTD5y+k@hmbx/

As long as I know, with -rt patchset, the tasklet context run in the other workqueue context. The above order becomes below:

processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context

hw IRQ context
-> workqueue is scheduled for hw IRQ

workqueue context (A)
->workqueue is invoked for the hw IIRQ
  -> workqueue is scheduled for the IT context

workqueue context (B)
->workqueue is invoked for the IT context
  -> workqueue is scheduled for period elapsed event

workqueue context (C)
->call of snd_pcm_period_elapsed()
  ->task for jackd is scheduled to wake up

process context
->jackd thread wakes up and fill the buffer with PCM frames

(for next IRQ)

From task scheduler, the priority of successive three workqueue context (A, B and C) is not deterministic, thus it's probable not to keep the above order. In the worst case, next workqueue context (A) runs earlier than previous workqueue context (C).

Hm. Anyway, I have never heard the similar issue. Would I ask you to do the same test with non-rt kernel? I'd like to check the same issue is regenerated regardless of -rt patches.

I note that my patch for ALSA PCM core was merged to v5.14, then the above order is shorten in stock kernel:

https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/sound/core?id=47271b1b98c9

processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context

hw IRQ context
->IRQ handler is invoked for the IT context
  -> tasklet is scheduled for the IT context

soft IRQ context
->tasklet is invoked for the IT context
  -> call of snd_pcm_period_elapsed()
    -> task for jackd is scheduled to wake up

process context
->jackd thread wakes up and fill the buffer with PCM frames

(for next IRQ)

But in the backport driver codes, the above is disabled due to backward compatibility.

Thanks

AreYouLoco commented 2 years ago

Right now I am doing the tests with stock kernel (5.14.0-4-amd64) and backported drivers. Next I will try stock kernel and stock drivers and will post result. The stock kernel is build without XRUN debugging options but I can tell if the JACKdbus process will hang or not.

Also as I stated before. The debug log is consisting more than one xrun logged and only one of them is critical that makes JACK process hang.

Let me test it for a few days and I will be back with the results.

AreYouLoco commented 2 years ago

So I think I figured it out. Seems like my script audio-tweaks.sh is making JACK hang somehow. Script is following:

#!/bin/bash

sudo modprobe -r ppdev # I don't have a parallel port
sudo modprobe -r lp # I don't use a printer when making music
sudo modprobe -r uvcvideo # I don't use a webcam when making music
sudo modprobe snd-hrtimer # Load the ALSA high res timer for my MIDI stuff

# "open up" the PCI bus by allowing fairly long bursts for all devices, increasing performance
sudo setpci -v -s "*:*.*" latency_timer=b0
# maximise latency for FireWire (IEEE 1394): Ricoh Co Ltd R5C832 PCIe IEEE 1394 Controller (rev 04)
# thus allowing more data per PCI transfer and minimizing xruns
sudo setpci -v -s 0d:00.3 latency_timer=ff
# ditto for the onboard Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Control>
sudo setpci -v -s 00:1b.0 latency_timer=ff

sudo cpufreq-set -g performance -c 0
sudo cpufreq-set -g performance -c 1

When I trigger the script by hand on kernel 5.14.0-4-amd64 with your backported driver I have the same exact XRUN that hangs the JACKdbus proccess. Do you have any clue why is it happening? Or do I need the script at all?

EDIT: I am not sure but I think on my RT kernel it didn't make JACK hang. I am building one more kernel to test and will post results.

AreYouLoco commented 2 years ago

I've build a new kernel which is lowlatency but not RT (5.12.0-rc8-lowlatency-fw) and it seems like this is the best kernel so far I've tested. No hangs in hours of playback. Will see how it goes but most likely the bug is only present on RT kernels.

EDIT: I was wrong;( The exactly one XRUN hang just happened on mentioned above kernel. I omitted RT patches so it's non-realtime kernel but with threaded IRQ and 1000Hz timer. I've build it with XRUN debug options will see if it's the same XRUN debug log...

[11368.138248] snd_oxfw fw1.0: XRUN: pcmC2D0p:0
[11368.138250] CPU: 0 PID: 130 Comm: kworker/0:1H Tainted: G           OE     5.12.0-rc8-lowlatency-fw+ #1
[11368.138253] Hardware name: [...]
[11368.138254] Workqueue: events_highpri pcm_period_work [snd_firewire_lib]
[11368.138259] Call Trace:
[11368.138261]  dump_stack+0x76/0x94
[11368.138264]  __snd_pcm_xrun.cold+0x5/0xa [snd_pcm]
[11368.138273]  snd_pcm_update_state+0x107/0x110 [snd_pcm]
[11368.138281]  snd_pcm_update_hw_ptr0+0x213/0x770 [snd_pcm]
[11368.138289]  ? __switch_to+0x84/0x480
[11368.138292]  snd_pcm_period_elapsed+0x75/0xa0 [snd_pcm]
[11368.138300]  process_one_work+0x1e3/0x3b0
[11368.138302]  ? process_one_work+0x3b0/0x3b0
[11368.138304]  worker_thread+0x50/0x3b0
[11368.138307]  ? process_one_work+0x3b0/0x3b0
[11368.138309]  kthread+0x142/0x160
[11368.138312]  ? kthread_associate_blkcg+0xc0/0xc0
[11368.138315]  ret_from_fork+0x22/0x30

So theory that it happens only on RT kernels seems to be not true.

EDIT: So the hang happened also on stock Debian kernel and your driver. So I think it's more a back-ported driver specific not kernel configuration related. I tried 3 different kernels with your back-ported driver: stock, low-latency and real-time and on all of them hang happens in different time matter but still happens.

AreYouLoco commented 2 years ago

I note that my patch for ALSA PCM core was merged to v5.14, then the above order is shorten in stock kernel:

* [git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/sound/core?id=47271b1b98c9](https://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git/commit/sound/core?id=47271b1b98c9)

processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context

hw IRQ context
->IRQ handler is invoked for the IT context
  -> tasklet is scheduled for the IT context

soft IRQ context
->tasklet is invoked for the IT context
  -> call of snd_pcm_period_elapsed()
    -> task for jackd is scheduled to wake up

process context
->jackd thread wakes up and fill the buffer with PCM frames

(for next IRQ)

But in the backport driver codes, the above is disabled due to backward compatibility.

How could I enable this feature to test it out in the back-ported driver? Could you make a patch? I am trying stock kernel module of kernel version 5.14.0-4-amd64 so it should be enabled there, right?

takaswie commented 2 years ago

How could I enable this feature to test it out in the back-ported driver? Could you make a patch? I am trying stock kernel module of kernel version 5.14.0-4-amd64 so it should be enabled there, right?

This patch is what you expected:

commit 7ba5ca32fe6e ("ALSA: firewire-lib: operate for period elapse event in process context")
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7ba5ca32fe6e

In Linux kernel 5.13 or before, the kernel symbol, snd_pcm_period_elapsed_under_stream_lock(), is not found since it was introduced in 5.14.

takaswie commented 2 years ago

For your information, relevant change of ALSA PCM core was done by the patch below:

commit 47271b1b98c9 ("ALSA: pcm: add snd_pcm_period_elapsed() variant without acquiring lock of PCM substream")
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=47271b1b98c9

AreYouLoco commented 2 years ago

So I tried stock Debian kernel (5.14.0-4-amd64) with stock kernel modules and the hang happened as well. This kernel produces more XRUNs in general. I will try to reproduce it once more to be sure. And then try combination of mentioned commits. Is changing the quirks parameter of firewire-ohci module going to make it something different?

EDIT: I reproduced the hang on stock kernel with stock modules.

parm: quirks:Chip quirks (default = 0, nonatomic cycle timer = 0x1, reset packet generation = 0x2, AR/selfID endianness = 0x4, no 1394a enhancements = 0x8, disable MSI = 0x10, TI SLLZ059 erratum = 0x20, IR wake unreliable = 0x40) (int)

What values should I try? Debian default is 0x50 I am trying value 0.

takaswie commented 2 years ago

What values should I try? Debian default is 0x50 I am trying value 0.

In my opinion, it is not a good approach to expand test conditions to diagnose situation.

When diagnosing situation, I think three points are important as methodology:

knowledge about mechanism
software model relevant to runtime
fact from experiments with various view

It's a kind of methodology. Unless fulfill the above items, final evaluation is not reliable.Unless getting reliable evaluation, nothing can be going to be changed for better direction.

In your case (and the most of jackd users), the methodology is almost the same and rigid; removal of XRUN in jackd which they found by change system and runtime configuration variously. However, it's far from the behaviour to fix system and software runtime, since current implementation of jackd is neither reference software for digital audio processing nor ideal ALSA application for modern hardware.

Well, tentatively, I let you decide your system for testing without any modification, then gather facts; e.g. logs for various demands to diagnosis. The fact should not be 'I still encounter XRUN in jackd process'. The configuration and modification to system and runtime should be public as much as possible so that the person can diagnose situation as deep as possible. I showed relevant context before, they are:

1. processor receives IRQ by 1394 OHCI controller for IT context
->hw IRQ is queued for the IT context

2. hw IRQ context
->IRQ handler is invoked for the IT context
  -> tasklet is scheduled for the IT context

3. soft IRQ context
->tasklet is invoked for the IT context
  -> workqueue is scheduled for period elapsed event

4. workqueue context
->call of snd_pcm_period_elapsed()
  ->task for jackd is scheduled to wake up

5. process context
->jackd thread wakes up and fill the buffer with PCM frames

(for next IRQ)

The 1st context is blind from system since it is the part of hardware protocol (PCI Express), thus we can not get any log via system.

The 2nd context is done by processor as service for system software. Linux 1394 OHCI controller (firewire-ohci kernel module) includes IRQ handler invoked in the context. You can get system log by configuring module debug parameter.

$ modinfo firewire-ohci
filename:       /lib/modules/5.13.0-21-generic/kernel/drivers/firewire/firewire-ohci.ko
...
parm:           debug:Verbose logging (default = 0, AT/AR events = 1, self-IDs = 2, IRQs = 4, busReset events = 8, or a combination, or all = -1) (int)

The 3rd context is done by system, typically in the end of IRQ handling. ALSA IEC 61883-1/6 packet streaming engine (snd-firewire-lib kernel module) includes tasklet code to process isochronous packet. The state of packet processing can be visible by Linux Tracepoints framework. The tracepoint event is snd_firewire_lib:amdtp_packet.

The 4th context is done by kernel thread for system high priority workqueue. The 5th context is for user process. They are analyzed by the other Tracepoints events.

In my opinion, comparison between 2nd and 3rd contexts gives us perspective for system IRQ invocation at XRUN. The comparison between 3rd, 4th, and 5th contexts gives us perspective for task scheduling at XRUN.

AreYouLoco commented 2 years ago

Ok. Thanks for explanation but I think I will need some more help to get into that methodology.

For now from what I understood I've set debug parameter of firewire-ohci to -1. And I will try to catch the bug and post logs from dmesg. So dmesg is just filled with: firewire_ohci 0000:0d:00.3: IRQ 00000040 IT

EDIT: I've catch the bug again and there is nothing in dmesg. Where should I look now? Should I dump the packets when the hang happens? Will the dump tell you something interesting?

The tracepoint event is snd_firewire_lib:amdtp_packet.

AreYouLoco commented 2 years ago

Could you help me a bit with the other contexts? Especially the 3rd and 4th mentioned.

takaswie commented 2 years ago

For 3rd context, you did already (tracepoints in snd_firewire_lib module). For 4th context as well (XRUN debug log enabled via kernel configuration and procfs node).

AreYouLoco commented 2 years ago

So what else could I do to help you debug that? The issue is critical in the long sessions.

takaswie commented 2 years ago

So what else could I do to help you debug that? The issue is critical in the long sessions.

In my opinion, it's a good start point to test with axfer(1) in alsa-utils. It's rework of aplay(1) and manage to be sample program for all of operations supported in ALSA PCM interface.

AreYouLoco commented 2 years ago

Ok so I do have it installed but it requires wav-a-like file to play and I have no wav that are long enough to capture a problem. I will try to download some mix and convert it to wav. ~~How do I use axfer transfer playback after all?~~

EDIT: Or maybe I could do capture to some file and the bug will be there at some point.

For now I use:

$ axfer transfer capture --verbose --rate=48 --file-type=wav --xfer-type=libasound --device=hw:CARD=O1640i,DEV=0 --fatal-errors > capture.wav
Hardware PCM card 2 'Onyx 1640i' device 0 subdevice 0
Its setup is:
  stream       : CAPTURE
  access       : RW_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 16
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 24
  buffer_size  : 4096
  period_size  : 1024
  period_time  : 21333
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 1024
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 4096
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
Scheduling model:
  irq
CAPTURE: Format 'Signed 32 bit Little Endian', Rate 48000 Hz, Channels 16
Fail to process frames: Broken pipe
CAPTURE: Expected 67108863frames, Actual 719904frames

How do I know that the bug occurred? This command requires jack to be stopped. Do I need to sample all 16 channels or 1 is just enough?

takaswie commented 2 years ago

access : RW_INTERLEAVED

The mmap/interleaved operation is preferable to emulate the runtime of practical application, thus add -M or --mmap option.

When using --verbose (-v) option several times, the runtime of axfer dumps detail information, thus add -vv.

Anyway I recommend to study the standard runtime of ALSA PCM applications by poring through manual of axfer-transfer several times, especially for SCHEDULING MODEL and DESIGN sections.

AreYouLoco commented 2 years ago

I've read the whole axfer manual two times.

$ axfer transfer capture --verbose --verbose --rate=48 --file-type=wav --xfer-type=libasound --device=hw:CARD=O1640i,DEV=0 --mmap capture.wav
attributes for mapped page frame:
  sample number: 0
    address: 0x7f07b59cd000
    bits for offset: 0
    bits/frame: 512
  sample number: 1
    address: 0x7f07b59cd000
    bits for offset: 32
    bits/frame: 512
  sample number: 2
    address: 0x7f07b59cd000
    bits for offset: 64
    bits/frame: 512
  sample number: 3
    address: 0x7f07b59cd000
    bits for offset: 96
    bits/frame: 512
  sample number: 4
    address: 0x7f07b59cd000
    bits for offset: 128
    bits/frame: 512
  sample number: 5
    address: 0x7f07b59cd000
    bits for offset: 160
    bits/frame: 512
  sample number: 6
    address: 0x7f07b59cd000
    bits for offset: 192
    bits/frame: 512
  sample number: 7
    address: 0x7f07b59cd000
    bits for offset: 224
    bits/frame: 512
  sample number: 8
    address: 0x7f07b59cd000
    bits for offset: 256
    bits/frame: 512
  sample number: 9
    address: 0x7f07b59cd000
    bits for offset: 288
    bits/frame: 512
  sample number: 10
    address: 0x7f07b59cd000
    bits for offset: 320
    bits/frame: 512
  sample number: 11
    address: 0x7f07b59cd000
    bits for offset: 352
    bits/frame: 512
  sample number: 12
    address: 0x7f07b59cd000
    bits for offset: 384
    bits/frame: 512
  sample number: 13
    address: 0x7f07b59cd000
    bits for offset: 416
    bits/frame: 512
  sample number: 14
    address: 0x7f07b59cd000
    bits for offset: 448
    bits/frame: 512
  sample number: 15
    address: 0x7f07b59cd000
    bits for offset: 480
    bits/frame: 512

Hardware PCM card 2 'Onyx 1640i' device 0 subdevice 0
Its setup is:
  stream       : CAPTURE
  access       : MMAP_INTERLEAVED
  format       : S32_LE
  subformat    : STD
  channels     : 16
  rate         : 48000
  exact rate   : 48000 (48000/1)
  msbits       : 24
  buffer_size  : 4096
  period_size  : 1024
  period_time  : 21333
  tstamp_mode  : NONE
  tstamp_type  : MONOTONIC
  period_step  : 1
  avail_min    : 1024
  period_event : 0
  start_threshold  : 1
  stop_threshold   : 4096
  silence_threshold: 0
  silence_size : 0
  boundary     : 4611686018427387904
  appl_ptr     : 0
  hw_ptr       : 0
Scheduling model:
  irq
Waiter type:
  default
Transfer: libasound
  access: MMAP_INTERLEAVED
  sample format: S32_LE
  bytes/sample: 4
  samples/frame: 16
  frames/second: 48000
  frames/buffer: 4096
Container: builder
  format: riff/wave
  sample format: S32_LE
  bytes/sample: 4
  samples/frame: 16
  frames/second: 48000
  max frames: 67108863
Mapper: demuxer
  target: single
  access: MMAP_INTERLEAVED
  bytes/sample: 4
  samples/frame: 16
  frames/buffer: 4096
CAPTURE: Format 'Signed 32 bit Little Endian', Rate 48000 Hz, Channels 16

AreYouLoco commented 2 years ago

Now it seems to run and with --fatal-errors it will stop at XRUN right? Will the capture.wav be any useful? I cannot really see the advantage of using axfer (yet!).

EDIT: I could somehow make it play some online radio and when it hangs I will see it right?

AreYouLoco commented 2 years ago

I was not using hardware for some time and today again it hanged. @takaswie could you elaborate more on why would I use axfer to hunt that bug?

AreYouLoco commented 1 year ago

Hi @takaswie I just checked long time on a bit different setup. I tried on desktop computer with better FireWire controller (chipset XIO2213B) and the hang doesnt happen at all. There are some xruns time to time but unnoticable. I think we can close this issue and blame faulty FireWire controller on Thinkpad T420.

takaswie commented 1 year ago

I just checked long time on a bit different setup. I tried on desktop computer with better FireWire controller (chipset XIO2213B) and the hang doesnt happen at all.

Good to hear it ;)

takaswie / linux-firewire-dkms

Mackie Onyx 1640i: JACK instance hangs randomly after long playback. #40