Real-time audio signal processing in python

spatialaudio / jackclient-python

🂻 JACK Audio Connection Kit (JACK) Client for Python :snake:

https://jackclient-python.readthedocs.io/

MIT License

132 stars 26 forks source link

Real-time audio signal processing in python #59

Closed ghost closed 5 years ago

ghost commented 5 years ago

Hello,

I've been trying to make jackclient work (if possible) for a real-time audio processing application in python. You can see the code below:

import jack
import numpy as np

jclient = jack.Client("jclient")

jclient.register_port("in", jack.IsInput)
jclient.register_port("out", jack.IsOutput)
jclient.activate()

jclient.connect("system:capture_1","jclient:in")
for n_playback_port in (1, 2):
    jclient.connect("jclient:out", "system:playback_{}".format(n_playback_port))
#sr = jclient.get_sample_rate()
buffer_size = jclient.get_buffer_size()

jcapture = np.zeros((1,buffer_size), 'f')    
jinput = np.zeros((1,buffer_size), 'f')
joutput = np.zeros((1,buffer_size), 'f')

try:
    while True:
        print("Capturing audio")
        try:
            jclient.process(joutput, jcapture)
        except jack.InputSyncError:
            print("InputSyncError")
        except jack.OutputSyncError:
            print("OutputSyncError")

        #process jcapture here

        print("Playing back\n")
        try:
            jclient.process(jcapture, jinput)
        except jack.InputSyncError:
            print("InputSyncError")
        except jack.OutputSyncError:
            print("OutputSyncError")            
except KeyboardInterrupt:
    pass

The problem is that even if I don't process the signal at all (use the code above) and just playback the input I hear the input but with interrupts between the consecutive frames (buffers). Am I doing something wrong or is there a better way to do this? I'm using 16000 kHz sample rate and 1024 samples buffer size for the jackd server.

mgeier commented 5 years ago

It looks like you are using a different module that happens to be called "jack" as well. Probably https://pypi.org/project/py-jack/?

If you use the jack module from this repository, your code will not work at all because most of the functions you are using don't exist.

Please have a look at the documentation: https://jackclient-python.readthedocs.io/

And the example programs: https://github.com/spatialaudio/jackclient-python/tree/master/examples

ghost commented 5 years ago

Thanks for the prompt reply!

I was testing both pyjack and your module so I guess I have confused a bit the functions, sorry for the inconvenience.

So, I see that thru_client.py does what I wanted to do above so I'll read the documentation to find out how I can extract each buffer, process it and then send it to the output (are there any examples for this?)

Thanks a lot.

mgeier commented 5 years ago

Yes, thru_client.py is an example for this, but it uses plain Python "buffer" objects instead of NumPy arrays.

If you want to get the input buffer as NumPy array, use get_array() instead of get_buffer().

If you want to send stuff to the output, you have to get the output array (also with get_array()) and simply write to it.

Note that if you don't have output data available at the moment, you have to fill the output array with zeros. If your data is incomplete, you have to fill the remaining part of the array with zeros.

And note that all this has to be done in a "process callback" function, otherwise it won't work as expected.

ghost commented 5 years ago

I get it now and it seems to be working, thank you so much :)

fotisdr commented 5 years ago

Hello again,

So, to be more specific, I am trying to implement a real-time noise denoiser. I combined the _thruclient and the _playfile script so you can see my final script below. It is working but I get a lot of dropouts (runs where the queue is empty). If you have any time to spare could you please check my code and tell me if there is any way to further improve it?

Thank you in advance.

Mortis19



"""Real-time single-channel denoising

"""
from __future__ import division
from __future__ import print_function
import argparse
#import os
import jack
try:
    import queue  # Python 3.x
except ImportError:
    import Queue as queue  # Python 2.x
import sys
import threading
import numpy as np

# jack options
parser = argparse.ArgumentParser(description=__doc__)
#parser.add_argument('filename', help='audio file to be played back')
parser.add_argument(
    '-b', '--buffersize', type=int, default=1,
    help='number of blocks used for buffering (default: %(default)s)')
#parser.add_argument('-c', '--clientname', default='file player',
#                    help='JACK client name')
parser.add_argument('-m', '--manual', action='store_true',
                    help="don't connect to output ports automatically")
args = parser.parse_args()
if args.buffersize < 1:
    parser.error('buffersize must be at least 1')

#argv = iter(sys.argv)
#defaultclientname = os.path.splitext(os.path.basename(next(argv)))[0]
#clientname = next(argv, defaultclientname)
#servername = next(argv, None)

#buffersize = 1024

q = queue.Queue(maxsize=args.buffersize)
#print(args.buffersize)
event = threading.Event()

def print_error(*args):
    print(*args, file=sys.stderr)

def xrun(delay):
    print_error("An xrun occured, increase JACK's period size?")

def shutdown(status, reason):
    print_error('JACK shutdown!')
    print_error('status:', status)
    print_error('reason:', reason)
    event.set()

def stop_callback(msg=''):
    if msg:
        print_error(msg)
    for port in client.outports:
        port.get_array().fill(0)
    event.set()
    #raise jack.CallbackExit

def process(frames):
    if frames != blocksize:
        stop_callback('blocksize must not be changed, I quit!')
    try:
        data = q.get_nowait()
    if data is None:
        stop_callback()  # Playback is finished
    else:
        client.outports[0].get_array()[:] = data.T
    except queue.Empty:
        stop_callback('Buffer is empty: increase buffersize?')

try:
    #import jack
    #import soundfile as sf

    client = jack.Client("thru_client")
    blocksize = client.blocksize
    samplerate = client.samplerate
    client.set_xrun_callback(xrun)
    client.set_shutdown_callback(shutdown)
    client.set_process_callback(process)

    client.inports.register('in_{0}'.format(1))
    client.outports.register('out_{0}'.format(1))
    i=client.inports[0]
    data=i.get_array()
    q.put_nowait(data)  # Pre-fill queue
    with client:
    if not args.manual:
        capture = client.get_ports(is_physical=True, is_output=True)
        client.inports[0].connect(capture[0])
        #print(capture[0])
        target_ports = client.get_ports(is_physical=True, is_input=True, is_audio=True)
        if len(client.outports) == 1 and len(target_ports) > 1:
            # Connect mono file to stereo output
            client.outports[0].connect(target_ports[0])
            client.outports[0].connect(target_ports[1])
        else:
            for source, target in zip(client.outports, target_ports):
                source.connect(target)
    timeout = blocksize * args.buffersize / samplerate
    while True:
        noisy=client.inports[0].get_array()
        #t=time.time()
        # processing input
        noisy.shape=(1,blocksize)
        #clean=np.zeros(frames,)
        noisy = np.expand_dims(noisy, axis = 2)
        clean = G_loaded.predict(noisy)
        clean = np.reshape(clean, (1,blocksize))
        clean = np.reshape(clean, (-1,)) # make it to 1d by dropping the extra dimension
        #print(time.time()-t)
        #
        q.put(clean,timeout=timeout)
    #q.put(None, timeout=timeout)  # Signal end of file
    event.wait()  # Wait until playback is finished
except KeyboardInterrupt:
    parser.exit('\nInterrupted by user')
except (queue.Full):
    # A timeout occured, i.e. there was an error in the callback
    parser.exit(1)
except Exception as e:
    parser.exit(type(e).__name__ + ': ' + str(e))

mgeier commented 5 years ago

I think the main problem here is that you use get_array() outside the process callback function. That is not allowed, please don't do that.

There is already one queue to move the audio data to the process callback after processing, but I guess you'll need a second queue to move the input data from the process callback to the main thread for processing. Or, if the denoising is quick enough (given the block size you want to use), you can do everything in the process callback itself.

Other than that, you should probably simplify your example to only use one input and one output port. If you need more ports later, you can add them once the first draft works as expected.

fotisdr commented 5 years ago

Thanks a lot for your assistance.

Regarding the input/output ports, I'm using one input port and one output port for the client created, I just connect the output of the client to both speakers in the end.

I've managed to do the processing inside the process callback (although I didn't notice anything different to be honest), but I had no success so far using two queues (I've used a second queue to pass the data as you suggested but I'm not sure about the put/get commands and the corresponding timeout that I have to use and I get full/empty queue or xrun errors).

mgeier commented 5 years ago

If you show your updated code, I can have a look.

In the process callback, you should definitely not use a timeout, you should only use the non-blocking variants of put() and get(). I'm not sure if a timeout on the other side of the queues makes sense ...

You should also read through https://github.com/spatialaudio/python-sounddevice/issues/148. It is not about JACK, but the mentioned usage of queues should be more or less the same.

fotisdr commented 5 years ago

Thanks for the prompt reply!

I managed to make it work after all using queues in the process callback. Specifically, I used another queue for passing the data to the main thread as you suggested by using the get_nowait and put commands for the two queues both in the main thread and in the process callback (without using any timeout).

Another question that I have in mind is whether I should transpose the data arrays that I put to and get from the queues (see lines 96 and 148 in my code below). Is there a reason for doing this? (I just copied it from the play_file example). https://raw.githubusercontent.com/fotisdr/pi-gan/353ea63912754a76d0bd25d54ff92dcd05b26701/denoise_overlap.py

Thanks again for your help.

mgeier commented 5 years ago

I'm glad to hear that you made good progress!

I'm a bit surprised that you are using a buffersize of 1 ... if you only need a queue with one element, you might as well get rid of the queue and do everything directly in the process callback. Or am I missing something?

About transposing the data: The array provided by the soundfile module contains the audio channels as columns. In other words, the shape of the array is (B, C), where B is the block size (in frames) and C is the number of channels.

If I would iterate over such an array, I would get a sequence of frames, i.e. arrays with shape (C,). However, in the play_file.py example I wanted to have a sequence of channels (i.e. arrays with shape (B,), because each JACK port only takes a single channel. In order to get this, I just transpose the array first, and then iterate over it.

for channel, port in zip(data.T, client.outports):
    port.get_array()[:] = channel

For people not used to Python's iterator concept and zip() and all those good things, it might be more straightforward to do something like that:

for c in range(data.shape[1]):
    client.outports[c].get_array()[:] = data[:, c]

I didn't test this, but I think this should do the same thing. I personally prefer the first, more Pythonic, way of doing it.

But anyway, if you are only dealing with a mono signal in the first place, you don't have to transpose it. You can simply assign it to your one and only output port.

BTW, the register() function returns the new port (see https://jackclient-python.readthedocs.io/en/0.4.5/#jack.Ports.register), so you could directly assign it:

i = client.inports.register('in_1')

fotisdr commented 5 years ago

Thanks a lot for your valuable information.

Regarding the queues, I used them in the first place because I wanted to implement overlap between frames (i.e. half of the frame size for the jack server than the processing procedure's frame size), so I though it would not be possible inside the process callback. However, now that I'm thinking it, I guess I could do the same in the process callback as well, although I haven't tried it yet. The buffersize of the queues was set to 1 because for higher sizes I didn't see any difference other than latency added up to the signal. So, my question is, is there any reason to use queues in a real-time application?

Thanks again for answering all of my questions...

EDIT: So, I tried doing the same inside the callback process, without using queues, but I get a lot of xrun errors for some reason ('thru_client' was not finished). You can find the code here: https://raw.githubusercontent.com/fotisdr/pi-gan/master/denoise_overlap_noq.py It should be the same code as before so I don't know why there are so many xrun errors introduced now. Any ideas?

mgeier commented 5 years ago

That's surprising to me, I would assume that if it works with a queue size of 1, it should work the same with everything done in the process callback.

I don't know why you get xruns in one case and not in the other. But you shouldn't worry about it, because in general it is a good idea to use queues anyway.

is there any reason to use queues in a real-time application?

Yes, definitely. It's the default way to do it.

The important limiting factor is that there is a limited amount of time for each audio block to be delivered to the sound card. If that deadline is missed, the output will be either silent or filled with garbage values (probably, but not reliably, the values of the previous output block). Either way, in most cases this leads to audible artifacts.

The problem is that there are multiple variables:

the time needed for processing might change if you use a different computer, or if other resource-hungry processes are running at the same time
the amount of processing might change if you make changes to the algorithm or if you change some parameters
the JACK block size can be selected by the user, so you should ideally not rely on it having a certain value (though I typically assume/assert that it doesn't change during the runtime of the program)

If you use queues, you can to some extent handle those variables. For example, if a user selects a small JACK block size (there might be other JACK applications running that need a certain block size), you can choose a larger queue size to compensate. Of course there are limits, it's not a magic wand.

If you are doing file or network IO, you should definitely use queues, because those processes might cause arbitrary delays, which can be taken care of by big enough queues.

Please also note that Python is in general not really suitable for real-time audio.

As I've written in the docs (https://jackclient-python.readthedocs.io/en/0.4.5/#jack.Client.set_process_callback):

Most Python interpreters use a global interpreter lock (GIL), which violates the above real-time requirement. Furthermore, Python’s garbage collector might become active at an inconvenient time and block the process callback for some time.

Because of this, Python is not really suitable for real-time processing. If you want to implement a reliable real-time audio/MIDI application, you should use a different programming language, such as C or C++.

One way to avoid the GIL and GC in the audio callback, is implementing the process function in C. If you are interested in this, you can have a look at my other project: https://github.com/spatialaudio/python-rtmixer.

BTW, the try/except in your process callback will not work, because the audio callback is called (by JACK) from a different thread. If you press Ctrl+C, this will only affect the main Python thread. If you press Ctrl+C, the main thread will exit the context manager, your JACK client will be deactivated and the callback will no longer be called (but the except clause in the callback will never be reached). In your case I think this will be fine, because the worst thing that can happen is that there is still some data in the queue in the end, which shouldn't really be a problem.

fotisdr commented 5 years ago

Okay, once again thank you very much for the lengthy and really helpful response :)

First of all, the truth is that, even with the use of queues in my previous example code (https://raw.githubusercontent.com/fotisdr/pi-gan/master/denoise_overlap.py), I am getting a somewhat distorted output at the moment, although I am still trying to find out the exact cause of this (if it's introduced from the algorithm or from python). The processing time needed right now for my code is barely lower than the available time so I might be getting these artifacts that you mentioned in certain callbacks.

I also just tried to increase the buffersize of the queues in my previous code a bit but I get these queue.empty errors after a while and the code stops. However, I tried this out in a rush so I need to look this more thoroughly.

I am aware of the limitations of Python but my processing algorithm is implemented in Python so I started doing everything in Python in the first place. However, your project seems really interesting and I think I should try this out, if it's possible to still use the python denoising algorithm.

About the try/except commands, I have already assumed that in the process callback they are probably unnecessary but I just left them unchanged (thanks for pointing this out).

Thanks a lot!

EDIT: I just changed the data = qin.get_nowait() command (line 148) to data = qin.get() and now it works for higher buffersize as well, although I don't really see a difference.

~~EDIT 2: I'm currently trying out your python-rtmixer (it looks very promising) but I'm having trouble finding a way to convert a numpy array to a buffer in order to play it back with the ringbuffer (i.e. I'm trying to do something equivalent to the lines 36-37 of your play_file.py example but for a numpy array: https://github.com/spatialaudio/python-rtmixer/blob/d7095f228f58d08fa69916d3657e4996526144cd/examples/play_file.py#L36-L37). What I'm trying to do is to record the input, get it as a numpy array and then produce the numpy array to the playback. You can see my code so far if you want of course but I'm still trying to find out the way the rtmixer works so don't except much! https://raw.githubusercontent.com/fotisdr/pi-gan/master/test.py~~

EDIT 3 (and final): I actually made it to implement what I mentioned above, it was simpler than I thought it was in the first place, you can find my code below: https://raw.githubusercontent.com/fotisdr/pi-gan/master/thru.py The only thing that I am not certain about at the moment is the reason why the numpy array that I get from the buffer has twice the size of the blocksize that I define in the beginning (blocksize=512 returns a 1024 size array). I just have to experiment now a bit to understand the time restrictions. Great work with the python-rtmixer, thanks a lot!!!

mgeier commented 5 years ago

The processing time needed right now for my code is barely lower than the available time so I might be getting these artifacts that you mentioned in certain callbacks.

Are you getting xruns at the same time? If not, the artifacts may have a different cause.

I get these queue.empty errors after a while

Maybe you are not pre-filling enough?

your project seems really interesting and I think I should try this out, if it's possible to still use the python denoising algorithm.

Yes, because you would also use queues to transport the audio data to a Python thread. But of course you would need to add some latency for that.

If you want very low latency, you might have to re-implement the algorithm in some compiled language like C. If you manage to avoid the GIL (see https://cython.readthedocs.io/en/latest/src/userguide/external_C_code.html#declaring-a-function-as-callable-without-the-gil), it might even work with Cython, but I've never tried that myself.

In jackclient-python it is not (yet) possible to use a callback like that (unlike the sounddevice module), but if you need it, I can implement it.

I just changed the data = qin.get_nowait() command (line 148) to data = qin.get()

Yes, that totally makes sense, because in the denoising thread you actually want to wait for the input data to appear. There is nothing else you could do before that, right?

It's just important that you don't block the audio thread waiting for something.

I actually made it to implement what I mentioned above, it was simpler than I thought it was in the first place

Cool!

If you take some extra care with your NumPy arrays, you don't even need .tobytes(). If they are C-contiguous and use the 'float32' data type, you can use them just like that, see:

https://github.com/spatialaudio/python-rtmixer/blob/d7095f228f58d08fa69916d3657e4996526144cd/examples/bleeperoo.py#L67-L69

You would need to do something like this:

np.zeros((blocksize, channels), dtype='float32')

Please note that play_buffer() and record_buffer() are not made to be called in rapid succession for "streaming" behavior. As you have implemented it, there might be gaps (empty blocks) in the output and if you are unlucky, you might even write two different audio blocks to the same output block. You should definitely use the ring buffer functions for your use case.

the reason why the numpy array that I get from the buffer has twice the size of the blocksize that I define in the beginning

I guess that's because np.zeros() by default uses dtype='float64', therefore you get twice the number of bytes you actually want.

fotisdr commented 5 years ago

Thanks again for the long comments. Thanks for solving the blocksize issue as well, I don't know why I forgot the dtype='float32' command in the np.zeros!

First of all, concerning the jackclient-python, I'm not getting xrun errors when I run the code usually (only rarely) so I think that it's the algorithm that is introducing the artifacts. The code is working properly now I think (https://github.com/fotisdr/pi-gan/blob/master/denoise_overlap.py), but I'm still not sure what difference it makes to use a higher than 1 buffersize for the queues.

Regarding the python-rtmixer, you can take a look at my code so far here: https://github.com/fotisdr/pi-gan/blob/master/rtmixer_cnn.py Right now I'm using the data.data command to get the buffer to bytes (line 94). The code seems to be working properly with the use of play_buffer() and record_buffer() commands, but I haven't fully understood the ringbuffer functions yet so I haven't succeded in using them properly. How could I implement the same code with the use of the ringbuffer commands instead of the play_buffer() and record_buffer() commands?

Right now both ways seem to function more or less the same, although I think that the python-rtmixer is a bit slower than the jackclient-python, probably because of the sounddevice module.

mgeier commented 5 years ago

I'm still not sure what difference it makes to use a higher than 1 buffersize for the queues

Did you try using very small JACK block sizes? Like e.g. 32, 16?

Often the problems only appear with small-ish block sizes.

Right now I'm using the data.data command to get the buffer to bytes (line 94)

There is no need to do that. You can simply pass the NumPy array to the rtmixer function.

But please do something like this immediately before:

assert data.dtype == 'float32'
assert data.flags.c_contiguous

This is not checked by the rtmixer functions. And if you don't check that, it can lead to very nasty bugs.

The code seems to be working properly with the use of play_buffer() and record_buffer() commands

It may work if the system isn't under load, or if you are just lucky.

But there will be situations where it doesn't work, and there won't be any warnings or anything.

How could I implement the same code with the use of the ringbuffer commands

Very similar to how you used the queue.Queue objects. Except that you don't have to worry about the process callback, because that's handled automatically.

You just have to create two ring buffers, make sure the one going to the callback has enough data in it and then start recording + playback. Then you just have to repeatedly check if data is coming from the callback, and if yes, process it and send the result back in the other ring buffer. If no data is there, just wait a little while and then try again.

There are some examples, but you'll have to combine a few things to get what you need.

https://github.com/spatialaudio/python-rtmixer/blob/master/examples/delay.py

In this example, the same ring buffer is used for both recording and playback, that means it is not accessed from the Python code at all. You'll need two ring buffers instead.

https://github.com/spatialaudio/python-rtmixer/blob/master/examples/play_file.py

In this example, there is only playback and therefore only one ring buffer. Note the pre-filling.

https://github.com/spatialaudio/python-rtmixer/blob/master/examples/plot_input.py

In this example, there is only recording and therefore only one ring buffer. The update_plot() function is called repeatedly (waiting interval milliseconds in between). The ring buffer may be empty or it may have one audio block or multiple audio blocks in it. The while loop is taking care of this.

Note that the ring buffer just holds a sequence of bytes, you'll have to take care of converting the data.

If you use the write() function, you can simply pass a NumPy Array (with the properties mentioned above), for reading you will probably need frombuffer().

fotisdr commented 5 years ago

Hello again (and happy new year!),

Thanks again for the long comments. I've corrected the mistakes in my previous code but I'm still having some difficulties implementing the same using ringbuffers. I think I got the basic idea but I haven't found out yet how to pass the data to the buffer each time. You mentioned the write() function for this purpose and from the examples and the documentation I can see how it works for a soundfile, but I'm still trying to find out how to convert the data from a numpy array to the buffer.

I have created a draft script to try and make this work, if you want you can look at it here (e.g. line 77 & 93): https://github.com/fotisdr/pi-gan/blob/master/denoise_rtmixer.py

Warm regards, Fotis

mgeier commented 5 years ago

The important thing to understand is Python's "buffer protocol". This provides a very nice and efficient way to interface with plain low-level memory from your Python code.

So-called "buffer" objects have a very minimalistic interface on the Python side, you can only do buf[x:y] (to get some bytes) and len(buf) (to get the total number of bytes) (if I didn't forget anything), but there is a larger interface from the C side (which you, luckily, don't have to know in order to use "buffer" objects).

NumPy works very well with "buffer" objects, and probably the only thing you'll ever need is this: https://docs.scipy.org/doc/numpy/reference/generated/numpy.frombuffer.html.

The fun thing is that you can use this in both directions, i.e. for reading from and writing to a "buffer" object.

When you do something like

myarray = np.frombuffer(buf, dtype='float32')

... you'll get a readable and writable NumPy array. Note that the actual memory is not copied anywhere, the new NumPy array is created from the existing memory!

With this information, you should be able to use everything that returns a buffer (or two buffers), like get_read_buffers(), get_write_buffers() and read().

There are, however, two more functions that don't return a buffer but they take a buffer object as argument: readinto() and write(). The former will write into the buffer object, the latter will read from it.

Now the question is: How do you get a "buffer" object from a NumPy array?

And that's easy: A NumPy array is a "buffer" object!

You just pass it to the function and you are done. But beware, you should make sure that the NumPy array uses dtype='float32' and that it is C-contiguous, otherwise you will get unexpected data that probably looks like garbage.

That's why I mentioned above that you should check it, e.g.:

assert data.dtype == 'float32'
assert data.flags.c_contiguous

fotisdr commented 5 years ago

Thanks a lot again, I made it and now it works! https://github.com/fotisdr/pi-gan/blob/master/denoise_rtmixer.py I hadn't fully understood the write and read functions and how to pass the buffer contents to the ringbuffer, but your comment clarified everything! It was easy after all...

Thank you for the long discussion, I am now trying several things to accelerate my model but I will try and see if the rtmixer wrapper really makes a difference compared to the jackd.

mgeier commented 5 years ago

I'm glad I could help!

Please note that there is a bug in your code if the first of the two inner while loops (starting in line 82) runs more than once. In this case, the buffer variable from the previous iteration(s) is overwritten.

Also, if the second inner while loop runs more than once, it will write the same buffer repeatedly.

Also the q.get_write_buffers() call in the second loop is superfluous since you don't use its result.

Finally, please note that the rtmixer module is work in progress. Some of the API should probably be changed at some point and some features are not implemented yet. In the source code and in the examples there are still quite a lot "TODO" comments. If you find something that is missing or not working correctly, please open an issue at https://github.com/spatialaudio/python-rtmixer/issues!

fotisdr commented 5 years ago

Thanks again but I'm not sure how to correct this bug. I tried solving this by substituting the while with an if: https://github.com/fotisdr/pi-gan/blob/master/denoise_ncs2.py but I'm not sure if it's the best way. To avoid that the first while loop runs more than once and overwrites the buffer, should I use a queue instead of an array or should I write the buffer immediately?

mgeier commented 5 years ago

If I understood correctly, you want to write each incoming buffer to the output queue. I would assume that it is easiest to "sleep" until both queues are ready for reading/writing one block and then read/write exactly one block.

Something like this (untested!):

while True:
    while qin.read_available < blocksize and q.write_available < blocksize:
        sleep(0.01)
    # read one block
    # do stuff to it
    # write one block

BTW, your sleep time in sleep(.0001) seems extremely short. With your sampling rate and block size, one block takes about 0.064 seconds.

OTOH, do you actually need to specify the blocksize of the stream? One great thing about the rtmixer.RingBuffer is that it is not limited to a fixed block size. I think you could leave the default blocksize=0 and just don't care about which block size is used for writing to the input queue. When you are reading from the queue, you can choose whatever block size you need, it doesn't have to be the same as the one used for writing!

fotisdr commented 5 years ago

The thing is that the processing algorithm that will be in between the read and write has a fixed input/output blocksize, so I need to wait until I can get an input block of given blocksize (e.g. 1024), process it and then write it to the output. Also, I suppose that it is better to separate the two conditions for the while loop (qin.read_available and q.write_available) in order to have some extra time available for processing until the write queue becomes available (or do you think that both qin and q get the same blocksize available at the same time? - I haven't fully understood this).

mgeier commented 5 years ago

the processing algorithm that will be in between the read and write has a fixed input/output blocksize

Yes, that was the point I was trying to make, I'm sorry I've been unclear. I meant that you can use rtmixer.MixerAndRecorder(blocksize=0) (or just don't mention blocksize because it's 0 by default), which allows the block size that PortAudio uses to be arbitrary (possibly changing during the lifetime of the stream). This also means that the end of the ringbuffer that the rtmixer module handles may use arbitrary block sizes.

But that doesn't keep you from using a fixed block size on your end of the ring buffers!

I suppose that it is better to separate the two conditions for the while loop (qin.read_available and q.write_available) in order to have some extra time available for processing until the write queue becomes available

That's a good point! If you do that, you can probably reduce the initially pre-filled size of the output ringbuffer (and therefore the latency) a little bit.

However, I've just gotten another idea: you could choose the output ringbuffer to be significantly longer than the input ringbuffer. In that case, it would never happen that there is no space in the output ringbuffer. And you wouldn't need separate checks.

I'm not sure which strategy is better, though ... it might be a matter of taste.

or do you think that both qin and q get the same blocksize available at the same time?

No. The time difference might be quite small, but it's definitely not the same time. I think this even depends on whether you call record_ringbuffer() or play_ringbuffer() first. Depending on that, the callback will either first write to the input queue or first read from the output queue. But of course this also depends on whether the output queue is pre-filled (which it should be!). By pre-filling you effectively reduce the size of the output queue.

Which would be a reason to use a longer output queue in the first place ...

Now that I'm looking at your code, I also see that you are using

rtmixer.RingBuffer(samplesize * channels, blocksize)

... which is actually a very short ringbuffer! It can only hold a single block. This is the same situation as we discussed above, where buffersize was 1. This kinda defeats the purpose of using a ring buffer in the first place.

You should use a ringbuffer size of several blocks.

If you decide to use blocksize=0, you can provide the length of the ring buffers in seconds, because "number of blocks" doesn't really mean anything in that case.

Please note how I've chosen the ringbuffer lengths in those examples: https://github.com/spatialaudio/python-rtmixer/blob/master/examples/sampler.py https://github.com/spatialaudio/python-rtmixer/blob/master/examples/plot_input.py https://github.com/spatialaudio/python-rtmixer/blob/master/examples/delay.py https://github.com/spatialaudio/python-rtmixer/blob/master/examples/play_file.py

BTW, if you find an error in any of the examples, please tell me! Those examples are actually not very well tested ...

fotisdr commented 5 years ago

I meant that you can use rtmixer.MixerAndRecorder(blocksize=0) (or just don't mention blocksize because it's 0 by default), which allows the block size that PortAudio uses to be arbitrary (possibly changing during the lifetime of the stream).

Okay, now it's clear

However, I've just gotten another idea: you could choose the output ringbuffer to be significantly longer than the input ringbuffer. In that case, it would never happen that there is no space in the output ringbuffer. And you wouldn't need separate checks.

Sound like a good idea, I'll try this.

Now that I'm looking at your code, I also see that you are using
rtmixer.RingBuffer(samplesize * channels, blocksize)
... which is actually a very short ringbuffer! It can only hold a single block. This is the same situation as we discussed above, where buffersize was 1. This kinda defeats the purpose of using a ring buffer in the first place.

Sorry, I have already fixed that but I forgot to update the code that I attached before. I'm actually using a different buffersize (2 or 8 times blocksize usually) instead of the blocksize variable. However, I was using the same buffersize for both queues, but now I'll try your idea of having different blocksizes.

BTW, if you find an error in any of the examples, please tell me! Those examples are actually not very well tested ...

I haven't found any errors so far, the only problem that I had with my current code was that, in the case where my processing was slower than the frame time (e.g. 64 ms in my case of 1024 samples and 16kHz sampling rate), the algorithm stopped producing output (it got stuck in my second while loop). For the jack-client on the other hand, in the same case I got a lot of xrun errors (and I got these ugly garbage fillings of buffer) but it least I know that it was working. However, after fixing the issues mentioned I guess that this maybe will be fixed. In general I feel that, after using both the rtmixer and the jack-client for real-time processing, the jack-client is slightly faster than the rtmixer, although I need to fix these issues to be certain.

EDIT: I tried what we discussed but, if I use the same while loop for both the input and the output queues, I can never get an output (the condition is never fullfilled and when is fullfilled the qin is larger than the blocksize). Also, what I've noticed after running for a while my script, is that it stops producing output after a while (it gets stuck on the while q.write_available < blocksize loop after a point - the same as when the processing algorithm is not fast enough as I mentioned above). Here you can find the "almost" working code so far: https://github.com/fotisdr/pi-gan/blob/master/rtmixer_thru.py I still think this it's a good idea to keep the two conditions separated but I'm not sure what is the correct way to do this.

mgeier commented 5 years ago

For the jack-client on the other hand, in the same case I got a lot of xrun errors

Yes, those xrun errors are a good thing!

In the rtmixer module, you'll have to explicitly check if there were any xruns, it will not be reported automatically!

You can check for this after an action has been completed with action.stats (see https://github.com/spatialaudio/python-rtmixer/blob/master/examples/plot_recording.py) or after the stream has been deactivated, with stream.stats (see https://github.com/spatialaudio/python-rtmixer/blob/master/examples/delay.py).

If you want xrun information while the action you are interested in is still running, you can use stream.fetch_and_reset_stats(), see https://github.com/spatialaudio/python-rtmixer/blob/master/examples/fetch_stats.py.

You should always check for xruns!

the jack-client is slightly faster than the rtmixer

This should really not be the case!

If rtmixer is still slower after you fixed the problems, please tell me! Because then there is something wrong!

and when is fullfilled the qin is larger than the blocksize

That shouldn't be a problem.

What about doing it like this (again, untested):

while True:
    while qin.read_available < blocksize:
        sleep(0.001)
    # read one block
    # do stuff to it
    while q.write_available < blocksize:
        sleep(0.001)
    # write one block

it gets stuck on the while q.write_available < blocksize loop after a point

As mentioned somewhere above, this should never happen if the output queue is large enough.

fotisdr commented 5 years ago

So, it seems to be working now but I still have some questions. Below you can find my updated code: https://github.com/fotisdr/pi-gan/blob/master/rtmixer_thru.py

First of all, as you can see, I used one 2blocksize sized queue for the input and one 8blocksize sized queue for the output. Since I'm trying to get the processed output with as little latency as possible, I would like the queue prefill to be as small as possible (preferably 1 blocksize). As far as I have understood, the size of the queue does not affect latency, just the size of the pre-filling of the output queue (this is correct, right?).

Also, what I've noticed so far (with the pre-filling being just one blocksize) is that sometimes the output works for a while and then it permanently stops (as I said before), or sometimes it doesn't work at all. I think that the cause of this is that, in some cases, the processing algorithm processes the input in a longer time period for a particular frame (i.e. not making it in the proper time - this can be fixed by increasing the pre-filling size, but I don't want to increase latency). If we consider the 1024 blocksize case for example (which corresponds to about 64 ms available processing time for 16kHz sampling rate), my processing algorithm takes about 40-45 ms to complete for each frame, which should be enough time for each blocksize. However, it can happen that in some frames the algorithm doesn't produce an output in the available time period and that's when the audio stops permanently (I think). On the other hand, when this happens with pyjack, it throws an xrun error and continues to the next frame. Is there a way to do this on rtmixer as well?

Comparing the two of them now it seems that the rtmixer is faster than pyjack (about 0.1 ms in a 64 ms buffer - twice as fast compared to pyjack!), so I would be more than happy to switch to rtmixer if I can fix these issues :)

mgeier commented 5 years ago

the size of the queue does not affect latency, just the size of the pre-filling of the output queue

Exactly!

That's a very important piece of information for understanding how this whole thing works.

And you should try to actually measure the latency and check if your (and my) assumptions are true!

Another thing is that the blocksize and latency parameters are not really independent. For lowest latency, you should probably use blocksize=0 (and I mean the stream's blocksize, not the block size of your processing algorithm!).

The setting latency='low' is normally a good default, but it might not be perfect for a given hardware and driver and host API. You can definitely try to experiment with lower values than 'low'.

sometimes the output works for a while and then it permanently stops (as I said before), or sometimes it doesn't work at all.

I see. TBH, I've forgotten how this works, but now that you are mentioning it, I'm remembering ... The rtmixer module is meant to be very strict and not at all error tolerant. That's why, if the output queue is empty, playback is stopped:

https://github.com/spatialaudio/python-rtmixer/blob/d7095f228f58d08fa69916d3657e4996526144cd/src/rtmixer.c#L388-L393

It's probably not obvious how to detect such a failure. Probably there should be more information added to the play/record ringbuffer "action", see this comment: https://github.com/spatialaudio/python-rtmixer/blob/d7095f228f58d08fa69916d3657e4996526144cd/src/rtmixer.h#L47

For now, you can try to regularly check something like this:

if play_action not in stream.actions:
    # raise error, ringbuffer underflow

You can see this technique used in https://github.com/spatialaudio/python-rtmixer/blob/master/examples/play_file.py.

this can be fixed by increasing the pre-filling size, but I don't want to increase latency

That's exactly the trade-off you'll have to find. You'll have to live with a non-zero latency ...

it can happen that in some frames the algorithm doesn't produce an output in the available time period

Yes, that's exactly the the reason why you use a flexible buffer in the first place. And if you want to be on the safe side, you'll have to sacrifice a bit of latency.

That's the nature of audio processing on general purpose computers with general purpose operating systems, and it is unavoidable.

when this happens with pyjack, it throws an xrun error and continues to the next frame. Is there a way to do this on rtmixer as well?

Currently, no.

But this is exactly the reason why I'm very interested in this discussion.

As I said, the rtmixer module is work-in-progess, and this is one of the features that are still missing.

If you want to give it a shot, you can make a pull request to add this feature.

If not, it would already be very helpful if you could make a suggestion how the API for such a feature should look like. There should probably be a flag similar to allow_belated to select if ringbuffer under/overflow should be allowed. And there should be some way to check if that happened or not.

Please create an issue or a PR on https://github.com/spatialaudio/python-rtmixer!

it seems that the rtmixer is faster than pyjack

That's good to know, I'm relieved.

It might become even slightly faster if you use blocksize=0 (or not, you'll see).

fotisdr commented 5 years ago

Another thing is that the blocksize and latency parameters are not really independent. For lowest latency, you should probably use blocksize=0 (and I mean the stream's blocksize, not the block size of your processing algorithm!).

Right now, my script doesn't work by setting blocksize=0, but I remember that you mentioned somewhere that I have to define times and not frame sizes (I haven't tested this yet).

The setting latency='low' is normally a good default, but it might not be perfect for a given hardware and driver and host API. You can definitely try to experiment with lower values than 'low'.

So, what exactly is the 'low' latency value? Also, the lowest possible latency value is by defining it to None or 0?

Thanks again for the feedback... Right now I am a bit busy with other things, but as soon as I have some time to spare I will take a look at this.

mgeier commented 5 years ago

my script doesn't work by setting blocksize=0

I don't see why not (but I didn't actually try). You can keep your blocksize variable as it is, just remove it from rtmixer.MixerAndRecorder() (or set blocksize=0 there).

I have to define times and not frame sizes

You don't really have to, but often it is easier to understand if you use values in seconds instead of multiples of block sizes.

But since your processing algorithm seems to have a fixed block size (right?), it's absolutely fine to use this for defining ringbuffer sizes etc.

So, what exactly is the 'low' latency value?

That's provided by PortAudio, if you want to know more details, you'll have to ask them.

You can find the default low and high latencies (as determined by PortAudio) for a given device with sounddevice.query_devices().

the lowest possible latency value is by defining it to None or 0?

There is no such thing as the "lowest possible latency". There is just a latency value where the likelyhood of an xrun using a given piece of code on a given system at a given time is reasonably low.

You can use latency=0, but then you'll get xruns all the time (or the audio backend ignores your value and silently uses some latency > 0).

If you use latency=None, the default latency is used, which is 'high', which is probably not what you want!

But please note that for certain devices, there is no difference between 'low' and 'high'.

as soon as I have some time to spare I will take a look at this.

Sure, let me know when there are any news.

fotisdr commented 5 years ago

Hello again,

Unfortunately I have been been working on other things and I didn't have time so far to look into the rtmixer. However, I have been using jackd for my signal processing in Python so far and I have a question. I have noticed that the jackd server always runs with a buffering of minimum 2 frames, specified in the -n (--nperiods) argument. I never really tweaked this because I wanted the lowest latency possible, so I kept the minimum value of 2 frames. Since I perform a buffering of previous frames sometimes in my Python script, I was wondering whether jackd can perform this buffering and access the buffered frames somehow. Is there a way to do this? (maybe it will save some processing time as well).

mgeier commented 5 years ago

AFAIK, the JACK API doesn't contain any way to find out how many frames are used for buffering. Much less is it possible to access data from buffers other than the one you are supposed to read from or write to.

When you use -n 2, this basically means that at any given time, your Python application has exclusive access to one buffer, and jackd has exclusive access to the other one. And for each new audio block, those two buffers are switched. It would be highly unsafe if both your application and jackd would access the same buffer at the same time.

If you use -n 3, this means that your Python application still gets only one buffer at a time, but jackd gets access to two buffers at the same time. Again, it would be unsafe if one of those three buffers would be accessed by your application and jackd at the same time.

Long story short, the --nperiods argument can be use to give more buffers to jackd, but not to your application.

If you need buffering in your application, you'll have to do that for yourself, for example with a queue.Queue of by using jack.RingBuffer.

maybe it will save some processing time as well

I don't see how this would save processing time, do you mean that it could potentially reduce latency?

If you mean latency: no, I don't think you can reduce latency with this. The latency is given by the block size multiplied by the "number of periods" (plus probably some constant value).

fotisdr commented 5 years ago

Thanks a lot once again, I now understand the concept of --nperiods argument. Regarding saving processing time, I meant to avoid using queues and directly access the buffers from Jackd, but now I see that this is not possible (or safe at least).

fotisdr commented 5 years ago

I actually have another question, if you don't mind. What I'm trying to do now (using jack-client) is to run the jackd server with a small blocksize (e.g. 32 with 16kHz sampling rate) to get the lowest latency but do the processing in larger frames (e.g. 256-sample frames). Do you think this is possible? I was thinking to use two large queues so to empty the input one every 256 samples, do the processing and then fill the output one with the processed 256 samples. Then, 32 samples would be popped out of the output queue each time and played back by the jackd server. I was reading this example of yours which does something similar in a way: https://github.com/spatialaudio/jackclient-python/blob/master/examples/play_file.py But I haven't found a proper way to do this yet. I guess the problem is that for each callback of the process function the whole queue is emptied. How can I do this in smaller steps? Do I need to use timeout for the queues?

Sorry for bothering you again!

EDIT: I'm not sure that this is the correct way to do it, but I guess I could put the 256 samples in 256/32=8 chunks inside the queue, although I don't know how to do this properly. I tried something like this but it's producing a lot of xrun errors (also I'm not sure if it's the best or the fastest way):

blocksize = 128
model_blocksize = 2*blocksize
queuesize = 10

q = queue.Queue(maxsize=queuesize)
qin = queue.Queue(maxsize=queuesize)

try:
    # initialise jackd client
    client = jack.Client("thru_client")
    blocksize = client.blocksize
    samplerate = client.samplerate
    client.set_xrun_callback(xrun)
    client.set_shutdown_callback(shutdown)
    client.set_process_callback(process)

    client.inports.register('in_{0}'.format(1))
    client.outports.register('out_{0}'.format(1))
    i=client.inports[0]

    capture = client.get_ports(is_physical=True, is_output=True)
    playback = client.get_ports(is_physical=True, is_input=True, is_audio=True)
    o=client.outports[0]
    timeout = model_blocksize / samplerate #16 ms

    # Pre-fill queues
    noisy=np.zeros(model_blocksize,)
    #qin.put_nowait(noisy)
    for j in range(0,int(model_blocksize/blocksize)):
        q.put_nowait(noisy[j*blocksize:(j+1)*blocksize])

    with client:
        i.connect(capture[0])
        # Connect mono file to stereo output
        o.connect(playback[0])
        o.connect(playback[1])

        while True:
            t=time()
            while time()-t<timeout:
                sleep(0.001) # do the processing here
            for j in range(0,int(model_blocksize/blocksize)):
                noisy[j*blocksize:(j+1)*blocksize]=qin.get()
            print(noisy.shape)
            for j in range(0,int(model_blocksize/blocksize)):
                q.put(noisy[j*blocksize:(j+1)*blocksize])`

The way I imagined this, I need to have a queue or ringbuffer (I am not sure about this yet) which I would fill up with processed data according to my processing rate and from which the processing function would extract data according to the playback rate. The thing here is that I want these two tasks to be separated, executing in different time periods (e.g. filling and emptying the queue in different times or rates, having a processing with more frames than the playback).

mgeier commented 5 years ago

Yes, it's definitely possible to use a small JACK block size and use a longer block size for processing.

But sadly, it will not reduce latency. In your example with a block size of 32 frames you would have to wait while collecting 8 input blocks until you can start processing your first 256 frames. This will cause the same latency as you had before (with a JACK block size of 256).

If you still want to use different block sizes (for other reasons than for reducing latency), you could try using the jack.RingBuffer. This way you don't have to write and read data in multiples of some given block size. You can just write a bunch of frames to the ringbuffer and read a bunch of frames from the ringbuffer, but those two sizes don't have to be a multiple of each other (you just have to check each time if enough data/space is available for reading/writing, respectively).

I don't really know what's the problem with your code snippet above, but those loops in the end sure look confusing. If you want me to have a closer look, please provide a full runnable example.

But I think it would make more sense in this case to use the jack.RingBuffer.

One important difference between queue.Queue and jack.RingBuffer is that in the former, you can use blocking functions to wait for data to be available for reading (or space available for writing). The jack.RingBuffer doesn't have blocking functions (because it is lock-free), so you'll have to check in a loop, sleeping for some time between the checks.

Incidentally, there's a very similar example over at the rtmixer project: https://github.com/spatialaudio/python-rtmixer/blob/master/examples/play_file.py

It of course doesn't use jack.RingBuffer, but the rtmixer.RingBuffer is very similar. It should hopefully be straightforward to port this example.

If you need help, please let me know. If you port the whole example, feel free to make a PR here, because it would be great to have an example that uses the jack.RingBuffer!

fotisdr commented 5 years ago

But sadly, it will not reduce latency. In your example with a block size of 32 frames you would have to wait while collecting 8 input blocks until you can start processing your first 256 frames. This will cause the same latency as you had before (with a JACK block size of 256).

So, what I realized after your previous post is that the jackd server uses this buffering (with the -n parameter), so the resulting delay in total for a given sampling rate and frame will be frame / sampling rate 2 (if using the smallest possible buffering). As a result, if I use for example 1024 samples for framing and 16 kHz sampling rate there will be a delay of 264=128 ms to listen back the processed input. However, if I use a smaller sampling rate for the jackd there will be the same delay added by my processing but a smaller delay added by this buffering (e.g. for 32 samples an additional 2ms delay instead of 64 ms). Is my conclusion correct or am I mistaken here?

I don't really know what's the problem with your code snippet above, but those loops in the end sure look confusing. If you want me to have a closer look, please provide a full runnable example.

These for loops are used to get the data from the input queue and then pass it to output queue in chunks. Since each frame is added as 1 item in the queue, I wasn't sure how to access fractions of this frame once it's in the queue. Is there a better/faster way to do this with queues? The code snippet worked but I got a lot of xrun errors in general. The thing is that I can't do the same processing in the same time as before using this code (I guess it's much slower and produces all these xrun errors). In case you are willing to take a look, I will provide you a more detailed example ~~in a while~~ tomorrow (unfortunately).

But I think it would make more sense in this case to use the jack.RingBuffer.

One important difference between queue.Queue and jack.RingBuffer is that in the former, you can use blocking functions to wait for data to be available for reading (or space available for writing). The jack.RingBuffer doesn't have blocking functions (because it is lock-free), so you'll have to check in a loop, sleeping for some time between the checks.

I was looking today at the RingBuffer and I'll give it a try tomorrow, but in general I don't like the idea of sleeping between checks (I think they create unnecessary delays in the whole processing but I might be wrong).

Incidentally, there's a very similar example over at the rtmixer project: https://github.com/spatialaudio/python-rtmixer/blob/master/examples/play_file.py

It of course doesn't use jack.RingBuffer, but the rtmixer.RingBuffer is very similar. It should hopefully be straightforward to port this example.

Yeah, actually that's how I got the idea of trying to do this whole thing. I remembered this example from the rtmixer and I wanted to apply the same logic here. I would still like to implement everything with rtmixer, now that I understand the basic concept, but these timing restrictions are still a problem for my applications. I hope I'll find some time next month to take a look at this again :)

If you port the whole example, feel free to make a PR here, because it would be great to have an example that uses the jack.RingBuffer!

Of course, if I make it with the RingBuffer I'll make a PR with this. Thanks a lot :)

EDIT: Here is the working example: https://github.com/fotisdr/pi-gan/blob/master/denoise_new.py I must have done something wrong because it works if I set timeout = blocksize / samplerate but it doesn't if I do timeout = model_blocksize / samplerate. The point is to do the processing every model_blocksize frames and fill/empty the queues at these times.

I also did an implementation with the jack.RingBuffer according to my implementation of the rtmixer.RingBuffer, you can take a look if you want: https://github.com/fotisdr/aecnn-rpi/blob/old/denoise_ringbuffer.py I didn't have time to test it yet so I don't know if it actually works. Is it complete or am I missing something? I am also not sure whether I should advance the pointers (commented out commands) or not.

mgeier commented 5 years ago

However, if I use a smaller sampling rate for the jackd there will be the same delay added by my processing but a smaller delay added by this buffering (e.g. for 32 samples an additional 2ms delay instead of 64 ms).

Sorry, I don't understand this statement.

The jack.RingBuffer doesn't have blocking functions (because it is lock-free), so you'll have to check in a loop, sleeping for some time between the checks.

I was looking today at the RingBuffer and I'll give it a try tomorrow, but in general I don't like the idea of sleeping between checks (I think they create unnecessary delays in the whole processing but I might be wrong).

I think my statement was a bit misleading, let me clarify: For applications like reading data from an audio file, it's necessary to regularly check whether space for writing is available and sleep for some time if not.

However, for applications where you have a "live" input signal, and you want to do some processing and then provide the resulting signal to the same sound device (or at least one that's synced with the input, e.g. by wordclock), you don't need to sleep at all! You just have to choose the sizes of the ringbuffers appropriately.

But let's first look at the version with the queue.Queue:

https://github.com/fotisdr/pi-gan/blob/master/denoise_new.py

The variable args is not defined.
You have to make a copy of the input block before putting it into the queue.
You don't need the maxsize argument here, and you don't need the queuesize variable. You can use this as an optimization to avoid re-allocations of the queue, but I would keep it simple in the beginning. You can add this afterwards if everything is working as expected (but you cannot do a lot of processing).
If you use queue.Queue, you don't need to sleep() at all. You can just use the blocking get() and put() functions (except in the process callback, where you should use the non-blocking functions!). If you remove the while loop that contains sleep(), it should work. You don't need the timeout variable at all.
I don't think you need .ravel(), right?
If your processing takes a long time (e.g. (fraction - 1) blocks), you will get a "Buffer is empty" message. Therefore, you should either increase the amount of pre-filling the output queue, or pre-fill the input queue as well. I think which one you choose doesn't really matter.

you can take a look if you want: https://github.com/fotisdr/pi-gan/blob/master/denoise_ringbuffer.py

I'll review this (or a later version of it) at a later time.

fotisdr commented 5 years ago

Sorry, I don't understand this statement.

I think I was confused a bit with this -nperiods parameters. Initially, I thought that the total latency to record something, process it and hear the processed input would be equal to blocksize/sampling rate, but now I see that the latency is actually double. If I got this right now, for example if we have frames with 1024 samples and 16 kHz sampling rate, the jackd server will need 64 ms to create the recorded frame and the processing will be performed in the next 64 ms, so eventually the processed frame is played back after 128 ms. I don't know how I thought this in the first place but, as you said, even if I use a small blocksize for the jackd, the resulting latency will always be the same.

However, for applications where you have a "live" input signal, and you want to do some processing and then provide the resulting signal to the same sound device (or at least one that's synced with the input, e.g. by wordclock), you don't need to sleep at all! You just have to choose the sizes of the ringbuffers appropriately.

That sounds great, I'll need to try this out and see!

But let's first look at the version with the queue.Queue: The variable args is not defined.

Sorry, the script I provided is a part of a bigger script which includes all the processing that I do, so I just copied-pasted the necessary parts only. I just use the variable args to pass some arguments through the argumentparser.

 You have to make a copy of the input block before putting it into the queue.

You mean that I should use commands like q.put_nowait(noisy[0,j*blocksize:(j+1)*blocksize,0].ravel())? Why is that wrong?

 If you use `queue.Queue`, you don't need to `sleep()` at all. You can just use the blocking `get()` and `put()` functions (except in the process callback, where you should use the non-blocking functions!). If you remove the `while` loop that contains `sleep()`, it should work. You don't need the `timeout` variable at all.

I see, again that sounds great, I'll try this.

 I don't think you need `.ravel()`, right?

I use ravel because the arrays that I use are 3d, so I need to convert it to 1d to pass it to the queues (I think I got a respective size error if I didn't use .ravel() but I'll check this out again).

 If your processing takes a long time (e.g. `(fraction - 1)` blocks), you will get a "Buffer is empty" message. Therefore, you should either increase the amount of pre-filling the output queue, or pre-fill the input queue as well. I think which one you choose doesn't really matter.

I think in these cases I just get a lot of xrun errors but I don't get a "Buffer is empty" message, maybe the processing catches up in the next frame.

Thanks a lot once again, I'll try your advises and I'll also try to make everything run with the ringbuffer as well.

mgeier commented 5 years ago

If I got this right now, for example if we have frames with 1024 samples and 16 kHz sampling rate, the jackd server will need 64 ms to create the recorded frame and the processing will be performed in the next 64 ms, so eventually the processed frame is played back after 128 ms.

Yes, that's right. But the latency you are experiencing will be more. On the one hand, the JACK back-end might do its own buffering in some cases. On the other hand, there will be additional latency caused by the analog-digital and digital-analog converters of your sound card.

If you want to know the actual latency, you should measure it.

You mean that I should use commands like q.put_nowait(noisy[0,j*blocksize:(j+1)*blocksize,0].ravel())? Why is that wrong?

No, I was talking about the process callback (where you don't have the noisy variable).

I meant something like:

datain = client.inports[0].get_array()
qin.put(datain.copy())

To test this (and to provoke failure when you don't have the copy), you'll have to use longer processing times, to get multiple blocks into the queue at the same time. For example, try reaching a delay of 1 second. Then you will hear the problem.

If you don't make a copy, several audio blocks in the queue will refer to the same block of memory and JACK will keep overwriting this memory. It's a mess.

You don't have this problem if you use the jack.RingBuffer, because there you'll have to copy the audio data anyway.

The queue.Queue just stores references to NumPy arrays, which themselves are just references to memory areas!

I think in these cases I just get a lot of xrun errors but I don't get a "Buffer is empty" message, maybe the processing catches up in the next frame.

You should check all possible error cases by choosing extreme settings and/or temporarily introducing sleep() calls.

Whenever you add some error check, you should also try if this error can actually happen.

HaHeho commented 5 years ago

datain = client.inports[0].get_array()
qin.put(datain.copy())

I have a question regarding that. So should it be a general rule of thumb then to always take a copy from the array received? And does that also refer to the step of delivering the data back to JACK?

e.g. this example, not using RingBuffer, actual processing step is in between of course

def _process_receive(self):
    # TODO: or port.get_array().copy() ?
    input_td = np.vstack([port.get_array() for port in self._client.inports])   
    return input_td

def _process_deliver(self, output_td):
    for data, port in zip(output_td, self._client.outports):
        port.get_array()[:] = data  # TODO: or data.copy() ?

mgeier commented 5 years ago

So should it be a general rule of thumb then to always take a copy from the array received?

No, only when you want to keep the array around for longer than the runtime of the process callback function.

For use within the callback function a copy is not necessary.

Either way, you don't need to explicitly copy the array in _process_receive(), because np.vstack() already makes a copy internally.

And you should never use .copy() on the output array, because then you would write your data to the copy of the array and the memory where JACK expects your data will be unchanged (probably filled with garbage values)!

"Do not cache the returned address across process() callbacks. Port buffers have to be retrieved in each callback for proper functionning."

I've also copied this sentence to the docs: https://jackclient-python.readthedocs.io/en/latest/#jack.OwnPort.get_buffer

It might be not clear enough, though. PRs for documentation improvements are welcome!

HaHeho commented 5 years ago

So should it be a general rule of thumb then to always take a copy from the array received?

No, only when you want to keep the array around for longer than the runtime of the process callback function.

For use within the callback function a copy is not necessary.

Thanks for the answer, that clears things up.

Either way, you don't need to explicitly copy the array in _process_receive(), because np.vstack() already makes a copy internally.

Yes, that's also what I've suspected (and would've asked for confirmation otherwise). Does the code with np.vstack() seem appropriate, or can you think of a more efficient way of gathering the input channels into a matrix?

And you should never use .copy() on the output array, because then you would write your data to the copy of the array and the memory where JACK expects your data will be unchanged (probably filled with garbage values)!

I think we were referring to different things here. I of course didn't mean to buffer or store the array references, but putting a hard copy of the data into the buffers like such.

for data, port in zip(output_td, self._client.outports):
        port.get_array()[:] = data.copy()

But that should not be necessary?

I've also copied this sentence to the docs: https://jackclient-python.readthedocs.io/en/latest/#jack.OwnPort.get_buffer

It might be not clear enough, though. PRs for documentation improvements are welcome!

I will try to contribute to this package if applicable, since I'm using and depending on it. But all the parts I need right now are already well implemented and working perfectly I believe. So thank you very much for YOUR contributions all around the scene and your engagement into the community. 👍

You might know from Jens what I am working on. Anyhow, the internal processing realizes partitioned overlap-save convolution and also some parts of that in the spherical harmonics domain. There an internal buffering over multiple frames is necessary of course. Even in the most basic case the input block in time domain gets buffered and shifted with it's predecessor. Considering this, a block basically always has to survive over at least two frames (being altered and shifted on the way eventually). Does that change something regarding your answer for receiving or delivering the arrays to JACK?

HaHeho commented 5 years ago

@mgeier do you have a quick answer to the two questions from above?

Does the code with np.vstack() seem appropriate, or can you think of a more efficient way of gathering the input channels into a matrix?

I think we were referring to different things here. I of course didn't mean to buffer or store the array references, but putting a hard copy of the data into the buffers like such. But that should not be necessary?

mgeier commented 5 years ago

@HaHeho sorry for the late response!

Does the code with np.vstack() seem appropriate, or can you think of a more efficient way of gathering the input channels into a matrix?

I think it's appropriate.

I normally prefer np.column_stack() and np.row_stack(), because it's easier for me to understand what they are doing. IMHO it's easier than figuring out what "horizontal" and "vertical" means in this case.

Anyway, if profiling shows that this is a bottleneck, you could try to avoid all those dynamic allocations. You could allocate a sufficiently large amount of buffers and then just re-use them and never allocate new buffers. OTOH, this may not be worth the effort, because NumPy might do something clever already to avoid re-allocations. Either way, you should measure this stuff before and after changing it.

Another thing you could try (but I don't know if it's worth it!), is to use the jack.RingBuffer as discussed further up in this issue. The difference would be that you choose the size of the ring buffer up front and all memory is allocated at once. And you would be forced to copy data into and out of the ring buffer, so no additional copies are necessary.

I think we were referring to different things here. I of course didn't mean to buffer or store the array references, but putting a hard copy of the data into the buffers like such. But that should not be necessary?

I think I misunderstood you.

I thought you were talking about something like this:

port.get_array().copy()[:] = data

This would be wrong and you would send garbage values to the sound card.

But now I think you mean this:

port.get_array()[:] = data.copy()

This would send the right data to the sound card, but the copy here would be superfluous.

When you assign to an array slice, the data has to be copied anyway. Using data.copy() would be an additional and unnecessary copy.

HaHeho commented 5 years ago

@mgeier thank you very much for the clear and helpful response.

Anyway, if profiling shows that this is a bottleneck, you could try to avoid all those dynamic allocations. You could allocate a sufficiently large amount of buffers and then just re-use them and never allocate new buffers. OTOH, this may not be worth the effort, because NumPy might do something clever already to avoid re-allocations. Either way, you should measure this stuff before and after changing it.

I was thinking exactly the same, not being able to second-guess whether it would be worth the effort. Intuition says that dynamic allocations in the main audio processing loop sound like a bad idea. On the other hand the processing is running smoothly up to a certain complexity. I was hoping to find somebody that tried it already at some point and could give a "reliable" prediction. Dynamic allocations in the audio processing thread actually happen a lot in my current implementation. Doing something like print(id(input_td)) reveals that different "identities" are used each frame (even though IDs seem to be reused in some following frames ... which makes sense since the memory was available again). But that would mean that dynamic allocation (even with possible numpy magic) seems like a bad idea, right?

Another thing you could try (but I don't know if it's worth it!), is to use the jack.RingBuffer as discussed further up in this issue. The difference would be that you choose the size of the ring buffer up front and all memory is allocated at once. And you would be forced to copy data into and out of the ring buffer, so no additional copies are necessary.

Internally my audio frames are buffered and shifted (e.g. partitioned convolution) by means of numpy.ndarray. File playback on the other hand is realized with multiprocessing.Queue related to the example in this repo. I don't know about jack.RingBuffer, but I assume its purpose is to be a more performant alternative to the queue implementations? Or would there also be use cases and benefits with complex data in frequency domain (being altered during the processing)?

And another last question regarding "profiling". So far I have very little (and I remember very inconvenient) experiences with profiling in Python. Instead, I have built some "benchmarking" around timeit which was also awful to implement, even though functions well now. I guess that is the way to go for isolated performance tests, i.g. different FFT implementations and such. Can you recommend an approach or package for real-life profiling of the entire audio processing chain? Is there an example in one of your repos?

mgeier commented 4 years ago

But that would mean that dynamic allocation (even with possible numpy magic) seems like a bad idea, right?

It depends on what you want to achieve.

If you want very low latency and a very low probability of xruns, you should simply not use Python. Use some compiled and low-overhead language like C, C++ or Rust. And don't make dynamic allocations from the callback function.

If you don't care about latency and the occasional audible artifact, you should feel free to continue using Python. Just use very big block sizes and try to not max out your CPUs, then you should be fine, even with many dynamic allocations.

You can of course try something in between, but I'm not sure it's worth to invest too much time.

As I've already mentioned above, I've created the project https://github.com/spatialaudio/python-rtmixer to experiment with using a C callback function but doing the rest in Python. A similar thing could also be done with JACK (we would just have to create a way to pass a CFFI function pointer as process callback).

Avoiding dynamic allocations is often a good thing, but sometimes it isn't worth making the code more complicated because of this.

I don't know about jack.RingBuffer, but I assume its purpose is to be a more performant alternative to the queue implementations?

It depends what you mean by "performant".

The main difference is that the queue module uses locks, while the ring buffer doesn't. This is mostly interesting if the callback function is implemented in a compiled language. If you use Python in your callback functions, you'll use locks anyway, because of the GIL.

A secondary difference is that the ring buffer reserves all memory up front, and not further allocations are made.

Or would there also be use cases and benefits with complex data in frequency domain (being altered during the processing)?

I'm not sure what you mean by this, but in general you can send arbitrary data through the ring buffer. It doesn't have to be time-domain audio samples.

Can you recommend an approach or package for real-life profiling of the entire audio processing chain? Is there an example in one of your repos?

Sorry, not really.

I've collected a few links here: https://nbviewer.jupyter.org/github/mgeier/python-audio/blob/master/optimization/index.ipynb

I've used SnakeViz before, which worked quite nicely.

I've also used the line_profiler, which is probably useful to analyze the process callback function line-by-line.