Ringbuffer underflow/overflow

fotisdr commented 5 years ago

I decided to make an issue here to summarize my issues/requests with rtmixer, perhaps for future fixes/updates. In general, what I am trying to do is to execute an audio processing algorithm in real-time using rtmixer. I've already implemented this with jack-client, but, since rtmixer is better for many reasons, I'm trying to do the same with rtmixer.

As discussed previously in the jack-client repo, my main issue is that, in cases where my processing algorithm is too slow, the audio stops. The cause is that the ringbuffer is empty (underflow) and the playback action is removed by the rtmixer: https://github.com/spatialaudio/python-rtmixer/blob/d7095f228f58d08fa69916d3657e4996526144cd/src/rtmixer.c#L388-L393 The thing here is that this happens even in cases where it shouldn't. What I mean is that, for a blocksize of 1024 samples and a sampling rate of 16 kHz for instance, the processing algorithm should have 60 ms available at the worst case. However, with a processing algorithm that takes about 40-50 ms I always get a ringbuffer underflow (I also put a printf inside the rtmixer.c if to make sure this was the cause). In fact, I measured the required time for the processing inside python and the worst case is that it takes 51 ms to execute. However, the audio playback always stops (ringbuffer underflows) and the 'weird' thing is that it always happens after 4 frames for my script (maybe it's not weird, I just haven't understood the reason yet). You can find the framework of my script at the end of this issue. On the other hand, this is solved if I double the pre-filling of the queue, but that increases latency and in my case is the last solution. I also tried increasing the latency from 'low' to 'high' and also the MixerAndRecorder blocksize to 0 but nothing helped (in fact setting blocksize=0 gives immediately a ringbuffer underflow and I still haven't found the cause). Still, what's troubling me is whether the buffer underflow makes sense for a processing algorithm that is at least 10 ms faster than the limitation. From various timings that I did there seems to be some delay each time in the filling of the input queue with the first frame, which may be causing this buffer underflow (I haven't found the exact reason yet). However, if I increase the pre-filling (also for the input queue) and set the blocksize=0 it works. Of course if a slow processing for a frame occurs it stops (because of a buffer underflow), but that makes sense as we said. Correct me if got this wrong but, by setting blocksize=0, no matter how much I pre-fill the queues I would assume that after a while the playback catches up to the case where no pre-fill existed (the latency difference is lost).
Another thing that would be really good is if there was a feature added to the API that could allow for the playback to continue after ringbuffer overflows/underflows. My C programming skills are not that great so I wasn't able to check somehow the buffer underflows and implement this from scratch, but I guess it would make sense to have a flag in the python API that will allow ringbuffer over/underflow in the rtmixer playback.

#!/usr/bin/env python

from __future__ import division, print_function
from time import time,sleep
import rtmixer
import sys
import numpy as np

blocksize = 1024

latency = 'low'
samplerate = 16000
channels = 1
qin_size = 4
q_size = 4*qin_size

stream = rtmixer.MixerAndRecorder(
    channels=channels, blocksize=blocksize, #blocksize=0?
    latency=latency, samplerate=samplerate)
with stream:
    print('  input latency:', stream.latency[0])
    print(' output latency:', stream.latency[1])
    print('            sum:', sum(stream.latency))
    print('requested delay:',timeout)

    samplesize = 4
    assert {samplesize} == set(stream.samplesize)

    qin = rtmixer.RingBuffer(samplesize * channels, qin_size * blocksize)
    record_action = stream.record_ringbuffer(qin)

    q = rtmixer.RingBuffer(samplesize * channels, q_size * blocksize)
    buffer = np.zeros((blocksize,1),dtype='float32') # or q_size*blocksize?
    q.write(buffer)
    play_action = stream.play_ringbuffer(q)

    try:
        while True:
            while qin.read_available < blocksize:
                if record_action not in stream.actions:
                    break
                sleep(0.001)
            if record_action not in stream.actions:
                break
            read, buf1, buf2 = qin.get_read_buffers(blocksize)
            t = time()
            buffer = np.frombuffer(buf1, dtype='float32')
            noisy[0,:,0] = buffer
            # processing of 'noisy' is performed here and 'clean' is computed
            buffer = clean.ravel() #.astype('float32')
            qin.advance_read_index(blocksize)
            while q.write_available < blocksize:
                if play_action not in stream.actions:
                    print('Ringbuffer underflow')
                    break
                sleep(0.001)
            if play_action not in stream.actions:
                print('Ringbuffer underflow')
                break
            q.write(buffer)
            print(time()-t) # measure processing time
    except KeyboardInterrupt:
        print('\nInterrupted by User')

mgeier commented 5 years ago

Thanks for creating this issue!

For reference, this is the previous discussion: https://github.com/spatialaudio/jackclient-python/issues/59.

I'd definitely like to have an option to allow empty/full ringbuffers without quitting.

The important thing is to also create a way to communicate this situation to the user, because this shouldn't go unnoticed.

I hope I can implement this soon.

As a first step, I've created an example script based on your script above, see #10.

Any suggestions for improvements?

fotisdr commented 5 years ago

As a first step, I've created an example script based on your script above, see #10.

Any suggestions for improvements?

Great, the example seems pretty good! I will test this with my algorithm as well to see if it's working better. I am still not quite sure but what I feel is that the input queue pre-filling is always necessary when setting the blocksize=0. I think that if you don't pre-fill there is always a callback with a ringbuffer underflow in the beginning and that's probably the reason why it didn't work in the first place for me. On the other hand, when having a defined blocksize it worked without pre-filling the input queue. Also, I am wondering whether my previous hypothesis is correct:

Correct me if got this wrong but, by setting blocksize=0, no matter how much I pre-fill the queues I would assume that after a while the playback catches up to the case where no pre-fill existed (the latency difference is lost).

mgeier commented 5 years ago

what I feel is that the input queue pre-filling is always necessary when setting the blocksize=0

I'm not sure. I think it's not strictly necessary but it makes sense to do it.

If the input queue is empty initially, the main loop simply waits until enough data is available. During this time, audio data (all zeros) is taken from the (pre-filled) output queue and played back. Therefore, there must be more pre-filling available in the output queue.

If the input queue is pre-filled, the DSP algorithm can start running immediately, but it has only zeros to work with. While the DSP algorithm is running, data is still taken from the (pre-filled) output queue, but as soon as the DSP algorithm is finished the first time, new data becomes available to fill the output queue. Therefore, less (output) pre-filling will be necessary.

I don't see what blocksize=0 changes here. AFAICT, it doesn't make a difference.

I think that if you don't pre-fill there is always a callback with a ringbuffer underflow in the beginning

If you don't pre-fill the input ringbuffer, you'll have to do more pre-filling in the output ringbuffer to avoid this initial (output) ringbuffer underflow.

Another (theoretical) option would be to call stream.play_ringbuffer() at a later time (or at the same time but with a given start value that's appropriately far in the future). But I'm not sure if that would actually work in practice.

But now that I'm thinking about it ... probably it wouldn't hurt to add a delay of dsp_size frames? Anyway, I think it doesn't really matter, because pre-filling the same amount of frames should have the same overall effect (I guess?).

when having a defined blocksize it worked without pre-filling the input queue

That's interesting, but it may not have been the original reason why it worked.

Did you check the actual block sizes when using blocksize=0?

Currently those block sizes are not reported, but I was thinking about storing the minimum and maximum block size (and probably the mean value) to be able to reason about that.

Correct me if got this wrong but, by setting blocksize=0, no matter how much I pre-fill the queues I would assume that after a while the playback catches up to the case where no pre-fill existed (the latency difference is lost).

That's an interesting observation. I would have said it's wrong, but recently I've seen a similar effect happening, though I don't know exactly what was going on. That's part of the reason why I made the new example script. I still have to do some experimenting ...

But again, I don't know what difference blocksize=0 is supposed to be making?

Contemplating this purely theoretically, the audio data you are pre-filling shouldn't vanish in the long run. The number of input and output frames should always be the same (assuming the same physical device with the same clock for input and output, the PortAudio API doesn't even allow different sizes!), and each audio frame passes through both ring buffers eventually. Since the amount of frames added is always the same as the amount of frames removed, where should the pre-filling go?

Over time, the amount of frames can shift between the two ring buffers, depending on how fast the DSP algorithm is running. If DSP is quick, the input ringbuffer never gets very full, if it is slow, the content of the input ringbuffer grows while the content of the output ringbuffer shrinks (until at some point you get underflow, or DSP gets quicker again).

But still, the total amount of frames doesn't change, nor does the latency, right?

fotisdr commented 5 years ago

But now that I'm thinking about it ... probably it wouldn't hurt to add a delay of dsp_size frames? Anyway, I think it doesn't really matter, because pre-filling the same amount of frames should have the same overall effect (I guess?).

I get your points. I guess that the delay should have exactly the same effect as the pre-filling does, it shouldn't make a difference.

Did you check the actual block sizes when using blocksize=0?

Currently those block sizes are not reported, but I was thinking about storing the minimum and maximum block size (and probably the mean value) to be able to reason about that.

No I actually didn't, how can you check the actual block sizes?

Over time, the amount of frames can shift between the two ring buffers, depending on how fast the DSP algorithm is running. If DSP is quick, the input ringbuffer never gets very full, if it is slow, the content of the input ringbuffer grows while the content of the output ringbuffer shrinks (until at some point you get underflow, or DSP gets quicker again).

But still, the total amount of frames doesn't change, nor does the latency, right?

Actually yes, now that I'm thinking it clearly it shouldn't change. I was thinking that the input queue is shrinking in the case that the dsp algorithm is fast and that this would lead to no latency, but this is not true (the input queue just stays empty for a longer time) so theoretically we shouldn't get a decrease in latency. I will test these out again to clarify everything.

EDIT: So, I used your script but the problem where the playback stops unexpectedly still exists. Specifically, with your sleep command the 'processing' takes about 51.27 ms and never stops. However, when I put my algorithm inside the function, although it takes about 26 ms to complete, the playback always stops after a while (some times it just processes 4 frames and stops, other times it continues for a while). The weird thing is also that, with your script, although there is no call of the processing function after this point, the script doesn't stop until I interrupt it. In my case, the script gets stuck in the first loop while (q_in.read_available < blocksize and record_action in stream.actions): where the q_in.read_available returns constantly 64 without increasing (for a blocksize of 1024 and a sampling rate of 16 kHz at least). As soon as I interrupt with Ctrl+C, the print functions in the bottom report that no underflows/overflows happened although the playback had stopped at some point. I still can't understand the reason for this, there should be a bug somewhere, might be something with the processing of the buffer? All I do inside the dsp function is this:

    # save buffer of shape (blocksize,1) into a 3d array of shape (1,blocksize,1)
    noisy[0,:,:]=buffer
    # compute 3d array 'clean' from 'noisy' 
    clean = processing_algorithm(noisy)
    # save clean 3d array (1,blocksize,1) into the buffer (blocksize,1)
    buffer=clean[0,:,:]

mgeier commented 5 years ago

how can you check the actual block sizes?

You'll have to modify the callback function to get that information. I think I will add this information to the stats structure at some point.

Specifically, with your sleep command the 'processing' takes about 51.27 ms and never stops. However, when I put my algorithm inside the function, although it takes about 26 ms to complete, the playback always stops after a while

This is strange. How are you measuring the duration?

If your algorithm allocates additional memory, this may take a different amount of time each time. Also, if there are some OS calls involved.

But there may also be things unrelated to your algorithm that may "steal" some time.

the script gets stuck in the first loop while (q_in.read_available < blocksize and record_action in stream.actions): where the q_in.read_available returns constantly 64 without increasing

Yeah, this looks like a bug somewhere. As long as record_action is active, q_in.read_available should keep increasing.

fotisdr commented 5 years ago

This is strange. How are you measuring the duration?

I'm just measuring using time.time():

t=time()
clean = processing_algorithm(noisy)
print(time()-t)

Yeah, this looks like a bug somewhere. As long as record_action is active, q_in.read_available should keep increasing.

So, that was the cause of this audio playback failure that I was describing in the first place. What I've actually noticed is that the loop always gets stuck at this point, where the q_in.read_available stops increasing, and that's how the audio playback stops, without actually producing a ringbuffer overflow/underflow.

mgeier commented 5 years ago

I have no idea yet what could be going wrong there. Probably it's just a silly bug in the callback function. How can I reproduce the problem?

BTW, I've just implemented a way to get information about the minimum and maximum block sizes: #13.

fotisdr commented 5 years ago

I have no idea yet what could be going wrong there. Probably it's just a silly bug in the callback function. How can I reproduce the problem?

That's the thing, I am not sure how exactly you can reproduce the bug. I am running some machine-learning algorithms in the processing functions and this happens sometimes, maybe a computationally expensive processing algorithm could do the trick. I can try to send you a stripped version of my algorithm on Monday, if you want.

BTW, I've just implemented a way to get information about the minimum and maximum block sizes: #13.

Perfect, I'll check this out next week :)

mgeier commented 5 years ago

I can try to send you a stripped version of my algorithm on Monday, if you want.

If you can strip it down far enough, I'd like to have a look. But I don't want to spend too much time installing a bunch of complicated libraries, if possible.

I thought about your use case a bit more, and I think rtmixer might not be the right tool in all cases:

If the parameters of your algorithm don't change (and the processing time doesn't fluctuate too much), you should be fine with just implementing a callback function in Python and using it with the sounddevice module. You might get a lower latency by doing that (but I'm not sure). You just have to set blocksize to the block size of your algorithm, and latency='low' or latency=0 (I don't know which works better).

If you want to avoid Python's GIL and its GC, you could try implementing your callback function with Cython (with the nogil feature). Please note that I've never actually tried that.

But if you want to dynamically change the parameters of your algorithm via Python code, I think using rtmixer with ring buffers could still be the right choice.

fotisdr commented 5 years ago

If you can strip it down far enough, I'd like to have a look. But I don't want to spend too much time installing a bunch of complicated libraries, if possible.

That's the reason why I avoided this, I guess it's difficult to send you a very simplified version of my code but I'll have a look.

If you want to avoid Python's GIL and its GC, you could try implementing your callback function with Cython (with the nogil feature). Please note that I've never actually tried that.

Yes, I wanted to try this with jack-client at some point but I never made it. Maybe I'll try again. In general, do you think that sounddevice alone would give a lower latency than jackd?

mgeier commented 5 years ago

In general, do you think that sounddevice alone would give a lower latency than jackd?

I would expect more or less the same minimal latency, but with JACK you have to specify a fixed block size explicitly, while with PortAudio you can use blocksize=0. But if you choose a fixed block size for your DSP algorithm, this doesn't really matter.

I guess with JACK the latency is somewhat more transparent and PortAudio has some non-obvious interplay of latency and blocksize, depending on the host API.

In the end, if you really care about latency, you should measure it.

spatialaudio / python-rtmixer

Ringbuffer underflow/overflow #9