spatialaudio / python-sounddevice

:sound: Play and Record Sound with Python :snake:
https://python-sounddevice.readthedocs.io/
MIT License
982 stars 145 forks source link

Speaker/Headphone Output Mixed with Mic Input Stream #359

Open saif-mahmud opened 2 years ago

saif-mahmud commented 2 years ago

If I try to read the audio stream from microphone using sd.InputStream when another audio is being played in the machine, it records input from both mic and speaker. Therefore, the recorded audio includes the noise of background audio along with mic input. However, I have assumed that it would record only the input from the mic which is required in my use case. Is there any workaround? My code is given below:

stream_in = sd.InputStream(
    device=device_in,
    samplerate=args.sample_rate,
    channels=1)
stream_in.start()
frame, overflow = stream_in.read(length)

Here, I have passed the default pulse input device (from the list below) as device_in and sample_rate is set to 16000.

$ python -m sounddevice
  0 HDA Intel PCH: ALC3234 Alt Analog (hw:0,2), ALSA (2 in, 0 out)
  1 HDA Intel PCH: HDMI 0 (hw:0,3), ALSA (0 in, 8 out)
  2 HDA Intel PCH: HDMI 1 (hw:0,7), ALSA (0 in, 8 out)
  3 HDA Intel PCH: HDMI 2 (hw:0,8), ALSA (0 in, 8 out)
  4 HDA Intel PCH: HDMI 3 (hw:0,9), ALSA (0 in, 8 out)
  5 HDA Intel PCH: HDMI 4 (hw:0,10), ALSA (0 in, 8 out)
  6 hdmi, ALSA (0 in, 8 out)
  7 pulse, ALSA (32 in, 32 out)
* 8 default, ALSA (32 in, 32 out)
HaHeho commented 2 years ago

Do you mean that the microphone also picks up sounds other than your voice, e.g. being played over the speakers from the machine (or any other source in the environment)?

saif-mahmud commented 2 years ago

I mean the microphone picks up the sound from the speaker/headphone other than my voice, not any source from surrounding environment.

Suppose, I'm streaming my voice in a web meeting, the input captures my voice as well as the sounds from other end which I hear in my headphone/speaker.

HaHeho commented 2 years ago

This is difficult to discuss when not using the proper vocabulary. From what I understand that is happening: your microphone is picking up the acoustic "feedback" from the loudspeakers. As part of the environment you're recording in. As every microphone would do (independent of this software API). Or did I misunderstand and you're confident that is not why you're recording other parts than your voice?

saif-mahmud commented 2 years ago

Please excuse me for not using proper vocabulary. Let me explain the issue again. Suppose, I am playing audio and listening to it on my headphone. If I want to capture my voice through sd.InputStream from the microphone, it records both my voice and the audio being played on the headphone.

I understand that it is natural for the microphone to pick up acoustic "feedback" from the loudspeakers as part of the surrounding environment. But if I disconnect the microphone, there should be no input source (I have no embedded microphone on my desktop and I'm using a regular wired headphone with mic). At this moment sd.InputStream is supposed to record nothing. However, if I play the audio while the microphone is disconnected, sd.InputStream captures the audio being played although not coming from any input source.

I have added the code snippet I'm using for denoising input audio stream below for your reference:

import os
from datetime import datetime

import numpy as np
import sounddevice as sd
import torch
import torchaudio

def parse_audio_device(device):
    if device is None:
        return device
    try:
        return int(device)
    except ValueError:
        return device

def process():
    # input stream
    device_in = parse_audio_device(args.in_)
    stream_in = sd.InputStream(
        device=device_in,
        samplerate=args.sample_rate,
        channels=1)

    # denoised output stream
    device_out = parse_audio_device(args.out)
    stream_out = sd.OutputStream(
        device=device_out,
        samplerate=args.sample_rate,
        channels=1)

    stream_in.start()
    stream_out.start()

    length = 16000

    input_frames = np.zeros((1, length), dtype=np.float32)
    output_frames = np.zeros((1, length), dtype=np.float32)

    while True:
        try:
            # Reading audio frames for input to denoiser
            frame, overflow = stream_in.read(length)
            frame = torch.from_numpy(frame).to(args.device)
            frame = frame.view(1, length)

            # writing input frames to sanity check the claim of input imputation
            input_frames = np.append(input_frames, frame.cpu().numpy(), axis=1)
            print('noisy frame written ---> input buffer')

            with torch.no_grad():
                # denoiser inference
                out = enhance_model.enhance_batch(frame, lengths=torch.tensor([1.]))

                # writing denoised frame to output stream
                output_frames = np.append(output_frames, out.cpu().numpy(), axis=1)
                print('enhanced frame written ---> out buffer')

                out = out.view(length)

            underflow = stream_out.write(out)

        except KeyboardInterrupt:
            print("Stopping")

            # Saving files
            _inp = torch.from_numpy(input_frames)
            _out = torch.from_numpy(output_frames)

            date_time = datetime.now().strftime("%d%m%Y_%H%M%S")
            in_file = 'in_mgan_frame_sliced_' + date_time + '.wav'
            out_file = 'out_mgan_frame_sliced_' + date_time + '.wav'

            torchaudio.save(os.path.join('results', in_file), _inp, 16000)
            torchaudio.save(os.path.join('results', out_file), _out, 16000)

            break

    stream_out.stop()
    stream_in.stop()
saif-mahmud commented 2 years ago

I am attaching sample files (input records using sound-device input stream) for your reference: https://drive.google.com/drive/folders/1UWZFqpFkOnMSOfVBx43oIrQ3lgrZhJdm?usp=sharing

HaHeho commented 2 years ago

Thanks for the audio examples. :)

You indeed described accurately what you were encountering. However, it sounded very unlikely, so I wanted to make sure that we are not talking about acoustic feedback.

The behavior is of course unintended ... and I have no idea why it happens. Although, I would say it is very unlikely caused by sounddevice. It is rather due to your code ... most likely what is happening within the torch. I have no experience with that at all, but I could imagine it overrides the data inside frame during the processing? So maybe also the data in input_frames gets altered, since it is was given a reference and not a deep-copy.

But then again, I have no idea what exactly is happening in frame = torch.from_numpy(frame).to(args.device) (in particular the last part). Maybe an explicit deep-copy with frame = torch.from_numpy(frame.copy()).to(args.device) already does the trick?

In any case, what you should check first is that the behavior is still encountered without any torch processing. If not, then you know that you will have to look there. ;)

HaHeho commented 2 years ago

Another comment regarding the code. All the data is gathered in memory before being written to file, This is of course not good when you intend to do longer recordings. In that case you want to have the output written continuously. This should be buffered by a queue in between (like for example in rec_unlimited.py, especially since you are also doing processing on the data.

But you (we) can figure that out after resolving the initial issue.

saif-mahmud commented 2 years ago

Thanks for your elaborated response. I've written a minimal python script getting rid of all torch dependencies and tested in the same scenarios as before. The code snippet is given below:

from pprint import pprint

import numpy as np
import sounddevice as sd
from scipy.io.wavfile import write

INPUT_DEVICE = 'default'

device_info = sd.query_devices(INPUT_DEVICE, 'input')
pprint(device_info)

SAMPLE_RATE = int(device_info['default_samplerate'])
FRAME_SIZE = 16000
CHANNELS = 1

if __name__ == '__main__':

    stream_in = sd.InputStream(device=INPUT_DEVICE,
                               samplerate=SAMPLE_RATE,
                               channels=CHANNELS)

    input_frames = np.zeros((FRAME_SIZE, CHANNELS), dtype=np.float32)

    stream_in.start()
    print('\nRECORDING AUDIO ...')

    while True:
        try:
            frame, overflow = stream_in.read(frames=FRAME_SIZE)
            input_frames = np.append(input_frames, frame.copy(), axis=0)

        except KeyboardInterrupt:
            print('\nSAVING INPUT STREAM TO .WAV FILE')
            write('minimal_sd_mic_only.wav', SAMPLE_RATE, input_frames)
            break

    stream_in.stop()

The configuration of my experiment was as follows:

OS: Ubuntu LTS 20.04
Python Version: 3.8.3
Sounddevice Version: 0.4.1
Pulseaudio Version: 13.99.1
Headphone Model: Logitech H111 Stereo

I've obtained the same result with this script as with torch. The sample output audio files are attached in this drive link: https://drive.google.com/drive/folders/1UWZFqpFkOnMSOfVBx43oIrQ3lgrZhJdm?usp=sharing

saif-mahmud commented 2 years ago

@HaHeho Can you please suggest any workaround regarding this issue?

HaHeho commented 2 years ago

Sorry for the delay.

So this is still very strange and obviously not what is supposed to happen. And it is also not happening for me, although on Windows and of course different hardware.

Are you sure this is not something that is caused by some strange setting in the audio driver?

Does the signal contain the expected components when you record with a different software (built-in recorder or Audacity)?

Do you have the option to test with a different microphone?

malfatti commented 2 years ago

I had a similar problem ~3 years ago. I tried everything to fix it, without success... until I noticed that, under alsamixer controls, there was a Loopback option enabled. After disabling it, I never had this problem again. To double check:

  1. Open a terminal and enter alsamixer;
  2. Press F6 and select your sound card ;
  3. Press F5 to show all controls;
  4. If there is a Loopback option, make sure it is disabled.

If, as @HaHeho suggested, the issue happens independently of the software used for recording, then this could be the cause of it.

Cheers :)

saif-mahmud commented 2 years ago

Thanks for the replies, @HaHeho @malfatti.

I have tried it in Windows as well as with the Audacity app, found the same issue. I might have to try with a different headset or microphone to check the device issue.

In the case of alsamixer, the loopback option is disabled and I'm facing the same issue. I have tried resetting alsamixer to default settings and then tweaking theloopcackoption; it didn't help.

KiriteeKarunya commented 1 year ago

@saif-mahmud Were you able to solve this issue? I am facing something similar. Please let us know what worked.