spatialaudio / python-sounddevice

:sound: Play and Record Sound with Python :snake:
https://python-sounddevice.readthedocs.io/
MIT License
980 stars 145 forks source link

Using stream with callback for play and record at same time #470

Open aaronchantrill opened 1 year ago

aaronchantrill commented 1 year ago

I'm trying to open a stream and play some audio to it while at the same time capturing audio.

My hope is to allow users to select two different devices for input and output and implement some sort of AEC filter between them.

I am running into some problems just getting the basic functionality working. You can see my code at https://github.com/aaronchantrill/sounddevice_vad/blob/main/aectest.py which is just a proof of concept and about as simple as I can manage.

I am sometimes getting odd PortAudio level error messages:

Expression 'err' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 3355
Expression 'ContinuePoll( self, StreamDirection_In, &pollTimeout, &pollCapture )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 3907
Expression 'PaAlsaStream_WaitForFrames( stream, &framesAvail, &xrun )' failed in 'src/hostapi/alsa/pa_linux_alsa.c', line: 4285

Other times, the program runs fine.

I have no idea what these messages mean, but they pretty much prevent the stream object from operating so that I do not hear the file play, and I don't end up with any data in the recording buffer. The errors don't appear to bubble up to the Python level where I would be able to handle them.

I don't know if this is a problem with my hardware. I haven't had an issue using sd.rec(), sd.play() and sd.playrec() so far.

The problem appears to happen after the sd.Stream() initialization or between callbacks. Sometimes one callback gets processed, other times none do, and other times everything works fine.

I'd love any suggestions about how I could make this more stable or otherwise improve it.

Thank you, Aaron

mgeier commented 1 year ago

What is the chunksize (which you are using as blocksize when creating the Stream)? Maybe the value is so big that PortAudio/ALSA cannot properly handle it?

If you are not relying on a certain block size, you can also just use the default value blocksize=0.

I guess your code example is not exactly what you want to do in the end, but if you want to handle large files (i.e. long recordings), it's probably not a good idea to use an unbounded queue on the playback side. Instead, what's typically done is to provide some "backpressure" by using a bounded queue. An example can be found here: https://github.com/spatialaudio/python-sounddevice/blob/master/examples/play_long_file.py

For the writing side, see https://github.com/spatialaudio/python-sounddevice/blob/master/examples/rec_unlimited.py (here, no backpressure is needed).

aaronchantrill commented 1 year ago

@mgeier Thank you for looking at this. The blocksize is calculated to be .03 seconds because WebRTCVad requires samples of .01, .02 or .03 seconds. In my case it is 480 frames (16,000 Hz 1 channel .03 seconds). I'm using VAD as a quick first indicator of whether the microphone is picking up anything it should pay attention to.

The code example is pretty close to what I'm trying to do. What I am trying to get working right here is to play a sound while simultaneously recording the output from the speakers. I start by playing the Hello_there.wav audio file and end by generating the Hello_there_echo.wav file, which contains the audio picked up by the microphone while the file was playing. When it works, it works fine.

Ultimately, what I am trying to do is implement some sort of AEC between an arbitrary speaker and microphone. I have been able to use an AEC filter between sink and source in pulseaudio/pipewire but that requires setting the default source/sink to the filtered versions since PortAudio does not appear to be able to access PulseAudio virtual devices directly - the best I have been able to do is use the "pulse" virtual device (which just uses the PulseAudio "default" devices) and I'd like to avoid having to reroute the user's entire audio system just for my application. Additionally, since this requires either PulseAudio or PipeWire, that seems like it would make compatibility with Windows more difficult while Windows compatibility is a big selling point for PortAudio/pyAudio.

In my case, the offset from the Hello_there_echo.wav file pretty much matches the expected offset (output .09+input .06 = .15 seconds delay) so when it works, I'm getting just what I am hoping for and am able to match up the audio I am playing with my recording of that audio playing. That's why it makes sense to me to use a single input/output stream in this case rather than individual input/output streams - it's easier to keep the output and input synced. The issue is that my sample code only runs successfully about 30% of the time. The rest of the time, I get the C error message from the original post and nothing plays through the speakers and no samples get recorded. This seems to occur when first opening the stream or before the first callback (nothing plays and the indata is empty), so it does not seem to be related to the length of the audio file. The audio file that I am using as an example (Hello_there.wav) is also pretty short (just over 1 second).

I have discovered that sometimes opening and closing the output stream causes "popping" which I have been trying to avoid by keeping the stream open and just writing zero's to it if I don't actually have any audio to play. This may not be a good idea, but it seems to work when I'm using pyAudio directly. Another benefit of using a queue is that it serializes the output so audio outputs play sequentially regardless of whether something is currently playing when they are added.

I'm trying to port this over to SoundDevice because I'm hoping a better supported project will simplify some of the maintenance. Since this is my first real attempt to use SoundDevice in a project, I'm hoping these issues are due to something odd I'm overlooking.

Thank you, Aaron

aaronchantrill commented 1 year ago

I have tested this code on an old x86_64 Surface laptop running Ubuntu 22.04, an old x86_64 Dell laptop running an up to date Arch, an aarch64 Libre ROC-RK3328-CC Renegade SBC running Armbian Bullseye, and Debian Bullseye running under WSL on a Windows 11 laptop. All of them exhibit the same behavior. They will work fine about 1/3 of the time, and the other 2/3ds of the time they generate the error message and do not play audio or capture incoming audio. The line numbers of the error messages change slightly, but the details are the same. This really makes me think it's something I'm doing, since it seems likely that everyone implementing an input/output stream with a callback would be getting the same issue.

It seems likely that my blocksize is too small at only .03 seconds/480 frames. In your examples, you are using 2048 frames or about .128 seconds of data between callbacks. Does it help to use blocksizes that are powers of 2? I would not be running the VAD directly on the indata anyway, but on the echo-cancelled indata, so I should be able to break that into any size blocks I want safely outside the stream loop.

aaronchantrill commented 1 year ago

Since the combined latency of my output (.09) and input (.06) streams is 0.15 seconds, I am trying working with 0.15 second blocks (2400 frames) just to make it easy to match up the input and output streams (basically I only have to hold one output block in memory which should match the input block of the next callback).

Is there any real reason to prefer powers of 2 when setting the blocksize?

If so, I should be able to set the latency a little higher and use blocks of 4096 instead of 2048 or 2400.

Thank you.

mgeier commented 10 months ago

Sorry for my late response.

If you are still getting those PortAudio error messages you should ask at the PortAudio project, maybe the folks there can help.

It seems likely that my blocksize is too small at only .03 seconds/480 frames.

I wouldn't say it's too small in general, this very much depends on the hardware and the host API.

With good hardware and drivers you can use 64 or 32 frames, maybe even less.

But if your code is supposed to run on a wide range of (cheap) hardware, maybe something like 1024 or 2048 is better.

As an example, Audacity uses the equivalent of latency=0.1 (and blocksize=0), which is quite conservative but should work nicely on most hardware.

In your examples, you are using 2048 frames or about .128 seconds of data between callbacks. Does it help to use blocksizes that are powers of 2?

I don't think it makes a big difference, but again this depends on the host API.

I guess most hardware supports power-of-2 block sizes, but "re-blocking" to other block sizes shouldn't cause any problems.

Since the combined latency of my output (.09) and input (.06) streams is 0.15 seconds,

I didn't fully understand whether you are using multiple streams now. In your example code you are using a single stream, but your comment suggests a separate input and output stream?

If you are using multiple streams, you shouldn't assume that the sample rate is synchronized, you should allow for drifting sample clocks.

aaronchantrill commented 9 months ago

Hi, thank you for getting back to me. I'm really not sure what you consider "good" hardware, or how that determination would be made. I'm using the microphone/speaker that came bundled with my Dell laptop, and also a USB conference phone. I have no idea how to evaluate the hardware from a good/bad perspective, other than that recordings of my voice sound good with no extra hissing or popping. I am looking to make a solution that works on a range of hardware.

I am attempting to use one input/output stream which may be connected to different hardware on the input and output sides. I am hoping that will help to keep things synchronized as much as possible. I guess I got the terminology confusingly wrong there because I think of the input and output as separate things even though they are both part of the same stream object.

mgeier commented 8 months ago

I'm really not sure what you consider "good" hardware, or how that determination would be made.

To a large extent that correlates with price. If your sound card costs $7, I wouldn't expect ultra-low latency, but if it costs $700, I would expect it to work reliably with a block size 32 or even 16.

You mentioned a block size of 480 before, which I think should be possible on any hardware, but who knows.