I need help, simultaneously record both microphone and speaker audio

Hello, this is a powerful library, but I am currently facing a problem. I want to record both microphone audio and speaker audio at the same time, with the speaker audio being recorded through LOOPBACK mode. How can I more scientifically mix the microphone audio and the speaker audio together? Can I use XtServiceAggregateStream?

I am currently recording audio from two separate devices and then using ffmpeg for mixing. However, I want to obtain a real-time mixed audio stream. How can I achieve this?

I have looked at the Aggregate example, but it aggregates the microphone's input audio to the speaker's output audio. However, this is not what I need. I need to record the audio from both the microphone and the speaker.

try (XtPlatform platform = XtAudio.init(null, Pointer.NULL)) {
  XtService service = platform.getService(Enums.XtSystem.WASAPI);
  try (XtDeviceList deviceList = service.openDeviceList(EnumSet.of(XtEnumFlags.INPUT))) {
      List<String> deviceCodes = new ArrayList<>(deviceList.getCount());
      for (int i = 0; i < deviceList.getCount(); i++) {
          String deviceId = deviceList.getId(i);
          String deviceName = deviceList.getName(deviceId);
          String deviceCode = deviceId + "@" + deviceName;
          deviceCodes.add(deviceCode);
      }
      this.deviceCodeList = deviceCodes;
      if (CollUtil.isEmpty(deviceCodes)) {
          return;
      }

      initWebsocket();

      Structs.XtFormat format = new Structs.XtFormat(new Structs.XtMix(48000, Enums.XtSample.INT16),
              new Structs.XtChannels(2, 0, 0, 0));
      List<XtDevice> devices = new ArrayList<>();
      List<XtStream> streamList = new ArrayList<>();
      for (int i = 0; i < deviceList.getCount(); i++) {
          String deviceId = deviceList.getId(i);
          EnumSet<XtDeviceCaps> caps = deviceList.getCapabilities(deviceId);
          RecordTypeEnum deviceSource = getDeviceSource(caps);
          if (checkDeviceSource(deviceSource)) {
              continue;
          }

          XtDevice device = service.openDevice(deviceId);
          devices.add(device);

          log.info("Add the device to monitoring, device id:{}", deviceId);

          Structs.XtBufferSize bufferSize = device.getBufferSize(format);
          Structs.XtStreamParams streamParams1 = new Structs.XtStreamParams(true, (stream1, buffer, user)
                  -> onBuffer(stream1, buffer, deviceId, deviceSource), this::onRun, this::onRunning);
          Structs.XtDeviceStreamParams deviceParams = new Structs.XtDeviceStreamParams(streamParams1, format, bufferSize.current);
          XtStream stream = device.openStream(deviceParams, null);
          streamList.add(stream);

          XtSafeBuffer.register(stream);
          byte[] bytes = new byte[stream.getFrames() * 2 * 2];
          bytesMap.put(deviceId, bytes);
      }

      streamList.forEach(XtStream::start);

      while (running) {
          TimeUnit.SECONDS.sleep(1);
          log.info("The device is currently recording...");
      }

      streamList.forEach(s -> {
          s.stop();
          s.close();
      });
      devices.forEach(XtDevice::close);
  } catch (Exception e) {
      log.error("Device audio recording exception.", e);
  } finally {
      recording = false;
      closeWebsocket();
  }
}

private int onBuffer(XtStream stream, XtBuffer buffer, String deviceId, RecordTypeEnum deviceSource) throws Exception {
        int code = 0;
        XtSafeBuffer safe = XtSafeBuffer.get(stream);
        if (Objects.isNull(safe)) {
            return code;
        }
        safe.lock(buffer);
        short[] audio = (short[]) safe.getInput();
        byte[] bytes = bytesMap.get(deviceId);
        int frames = buffer.frames;
        long sumOfSquaredSamples = 0;
        for (int frame = 0; frame < frames; frame++) {
            for (int channel = 0; channel < 2; channel++) {
                int sampleIndex = frame * 2 + channel;
                int byteIndex0 = sampleIndex * 2;
                int byteIndex1 = sampleIndex * 2 + 1;
                bytes[byteIndex0] = (byte) (audio[sampleIndex] & 0x000000FF);
                bytes[byteIndex1] = (byte) ((audio[sampleIndex] & 0x0000FF00) >> 8);
                sumOfSquaredSamples += audio[sampleIndex] * audio[sampleIndex];
            }
        }

        deviceSilentTimeMap.computeIfAbsent(deviceId, k -> new AtomicLong(DateTool.now()));
        AtomicBoolean hasSend = deviceHasSendMap.computeIfAbsent(deviceId, k -> new AtomicBoolean(false));

        double averagePower = (double) sumOfSquaredSamples / (frames * 2);
        double silenceThreshold = 1000.0;
        if (averagePower < silenceThreshold) {
            if (!hasSend.get()) {
                safe.unlock(buffer);
                checkDeviceSilent();
                return code;
            }
            checkDeviceSilent();
        } else {
            hasSend.set(true);
            deviceSilentTimeMap.get(deviceId).set(DateTool.now());
        }

        int byteCount = frames * 2 * 2;
        if (Objects.nonNull(websocket) && websocket.isOpen()) {
            PayloadBO payload = new PayloadBO();
            payload.setSessionId(action.getSessionId());
            payload.setUserId(action.getUserId());
            payload.setSource(deviceSource.name());
            payload.setDeviceGuid(WinUtil.getGuid(deviceId));
            payload.setIndex(index.get());
            payload.setAudioBase64(Base64.getEncoder().encodeToString(NetUtil.compressGzip(bytes, 0, byteCount)));
            websocket.send(payload);
        }

        safe.unlock(buffer);
        return code;
    }

Hey!

You should be able to use an aggregate stream for this. Unfortunately I cannot test this ATM because my current setup doesn't have the required audio inputs. Best I can do is try to aggregate headset input with a wasapi loopback stream on the speakers, but unfortunately there is no single sample rate that is supported by both devices, and neither xtaudio or wasapi do sample rate conversion so i'm out of luck. The C# example below should get you started, but since i cannot test it, it's probably full of bugs. On a side note, you should never do I/O in the audio callback. No file I/O, no websockets etc etc.

using System;
using System.IO;
using System.Threading;
using Xt;

namespace ConsoleApp18
{
    internal class Program
    {
        [STAThread]
        static void Main(string[] args)
        {
            using var audio = XtAudio.Init(nameof(Program), 0);
            var service = audio.GetService(XtSystem.WASAPI);
            using var list = service.OpenDeviceList(XtEnumFlags.Input);
            for (int i = 0; i < list.GetCount(); i++)
                Console.WriteLine(list.GetName(list.GetId(i)));

            // These are the indices for my system:
            int microphoneInIndex = 0; // Headset Microphone (Jabra EVOLVE 20 MS) (Shared)
            int speakerLoopbackIndex = 5; // Speakers / Headphones (Realtek Audio) (Loopback)

            using var microphoneInDevice = service.OpenDevice(list.GetId(microphoneInIndex));
            using var speakerLoopbackDevice = service.OpenDevice(list.GetId(speakerLoopbackIndex));

            var mix = new XtMix();
            // my mic only supports 16 khz
            // but the loopback device does NOT support 16khz so i cannot test
            mix.rate = 48000;
            mix.sample = XtSample.Float32;
            var deviceParams = new XtAggregateDeviceParams[2];
            deviceParams[0] = new XtAggregateDeviceParams();
            deviceParams[0].bufferSize = 20;
            deviceParams[0].channels.inputs = 2;
            deviceParams[0].device = microphoneInDevice;
            deviceParams[1] = new XtAggregateDeviceParams();
            deviceParams[1].bufferSize = 20;
            deviceParams[1].channels.inputs = 2;
            deviceParams[1].device = speakerLoopbackDevice;
            var streamParams = new XtStreamParams(false, OnBuffer, null, null);
            var aggregateParams = new XtAggregateStreamParams(in streamParams, deviceParams, 2, in mix, microphoneInDevice);

            using var stream = service.AggregateStream(in aggregateParams, null);
            using var safe = XtSafeBuffer.Register(stream);
            _mixdownBuffer = new float[2 * stream.GetFrames()];
            stream.Start();
            for(int i = 0; i < 10; i++)
            {
                Console.WriteLine(i + 1);
                Thread.Sleep(1000);
            }
            stream.Stop();

            var byteBuffer = new byte[2 * 48000 * 10 * sizeof(float)];
            // 2 = stereo
            Buffer.BlockCopy(_mixdownBuffer, 0, byteBuffer, 0, _mixdownPosition * 2 * sizeof(float));
            File.WriteAllBytes("C:\\temp\\mixdown.raw", byteBuffer);
        }

        static int _mixdownPosition = 0;
        static float[] _mixdownBuffer = new float[2 * 48000 * 10];

        static int OnBuffer(XtStream stream, in XtBuffer buffer, object user)
        {
            var safe = XtSafeBuffer.Get(stream);
            safe.Lock(in buffer);
            var input = (float[][])safe.GetInput();

            // _mixdownBuffer.Length / 2 because stereo
            for (int f = 0; f < buffer.frames && _mixdownPosition < _mixdownBuffer.Length / 2; f++)
            {
                // non-interleaved
                float microphoneLeft = input[0][f];
                float microphoneRight = input[1][f];
                float speakerLoopbackLeft = input[2][f];
                float speakerLoopbackRight = input[3][f];

                // interleaved
                _mixdownBuffer[_mixdownPosition * 2 + 0] = (microphoneLeft + speakerLoopbackLeft) / 2;
                _mixdownBuffer[_mixdownPosition * 2 + 1] = (microphoneRight + speakerLoopbackRight) / 2;
                _mixdownPosition++;
            }

            safe.Unlock(in buffer);
            return 0;
        }
    }
}

@sjoerdvankreel Thank you very much for your reply.

However, I do not have experience in audio development, so the onBuffer part is particularly troubling for me, and I cannot envision the specific data structure.

Secondly, what I ultimately want to achieve is to merge any number of audio devices into one stream for recording, which has also presented a challenge for me in terms of scalability.

If you have time, could you please provide another example of aggregating and recording from multiple devices? I would be immensely grateful.

@sjoerdvankreel There has been new progress.

Following your ideas, I have finally managed to implement the function of mixing and recording audio from both the microphone and speakers in my own way.

Although you warned against performing file I/O operations within the onBuffer, but this is a good start. Thank you once again.

private static int onBuffer(XtStream stream, Structs.XtBuffer buffer, Object user) throws Exception {
    XtSafeBuffer safe = XtSafeBuffer.get(stream);
    safe.lock(buffer);

    byte[] mixedData = new byte[buffer.frames * 2 * 2];

    short[][] input = (short[][]) safe.getInput();
    for (int f = 0; f < buffer.frames; f++) {
        short microphoneLeft = input[0][f];
        short microphoneRight = input[1][f];
        short speakerLoopbackLeft = input[2][f];
        short speakerLoopbackRight = input[3][f];

        short mixedLeft = (short) Math.min(Math.max((microphoneLeft + speakerLoopbackLeft) / 2, Short.MIN_VALUE), Short.MAX_VALUE);
        short mixedRight = (short) Math.min(Math.max((microphoneRight + speakerLoopbackRight) / 2, Short.MIN_VALUE), Short.MAX_VALUE);

        mixedData[4 * f] = (byte) (mixedLeft & 0xFF);
        mixedData[4 * f + 1] = (byte) ((mixedLeft >> 8) & 0xFF);
        mixedData[4 * f + 2] = (byte) (mixedRight & 0xFF);
        mixedData[4 * f + 3] = (byte) ((mixedRight >> 8) & 0xFF);
    }

    fileout.write(mixedData);

    safe.unlock(buffer);
    return 0;
}

Looks good! Although besides no I/O in the callback, you also shouldn't allocate memory. So no "byte[] mixedData = new byte[buffer.frames 2 2];". Realtime audio is kind of a pain when it comes to that stuff. The sort-of standard solution to these problems is to use a lock-free circular buffer with (in your case) the audio thread being the writer, and you can set up a low priority background thread that acts as the reader and dumps stuff from the circular buffer to file. If you dont do it this way (i.e. you keep doing I/O in the callback) you are pretty much guaranteed to glitch audio (introduce pops/clicks) at some time. There's a very good article on the subject over here: http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing.

@sjoerdvankreel I now understand why you kept emphasizing not to perform I/O operations within callbacks!

My new approach is to directly put mixedData into a LinkedBlockingQueue and handle the data in the queue with a new separate thread, avoiding complex business operations within the callback.

Thank you for providing this article; it has been of great help to me.

That's better than doing I/O, but unfortunately still not good enough. Because the LinkedBlockingQueue, you know.. blocks! What you really want is something lock-free. My java isn't all that great these days but i think you'd be better off with something like ConcurrentLinkedQueue (https://www.baeldung.com/java-queue-linkedblocking-concurrentlinked) or this: https://github.com/asgeirn/circular-buffer. Whatever you choose, it is also very important that the data structure you go with is bounded (i.e. under no circumstances does it allocate memory after construction). See here https://stackoverflow.com/questions/10130847/java-bounded-non-blocking-buffer-for-high-concurrent-situation.

Also forgot to mention, the stream aggregation may eventually introduce glitches of its own. I am not sure if this will be an issue in practice, but since you mentioned "what I ultimately want to achieve is to merge any number of audio devices", i think I ought to mention it. Imagine you aggregate 2 audio devices which are physically different. So not "analog in" and "digital in" on the same soundcard, but "analog in" on card A and "analog in" on card B.

These devices would run off different hardware clocks. In a perfect world, they are in sync, but what probably will happen is this: you set the sampling rate to 48000. But since no hardware is perfect, on card A, that is actually 47999.999. And on card B, that is actually 48000.001. So, they get out of sync. When that happens, xt-audio will start dropping extra samples or start zero-padding missing samples. There's not much you can do about this. Just know that xt-audio has the notion of a "master device" for an aggregate stream, and it's samples will never be padded or dropped. Only secondary/tertiary/etc streams will have this behaviour. You can monitor if this happens using the XtOnXRun callback, but there's really no way to prevent it.

Anyway, I'm really curious to see where you are going with this. I have never really stressed the aggregate stream implemention like this, so i wonder how well it fares :) Let me know how it works out!

@sjoerdvankreel Thank you for your patient response, I must say you always manage to pinpoint the crux of the problem.

Regarding LinkedBlockingQueue and ConcurrentLinkedQueue, I will conduct thorough research to determine which one is more suitable for my current business scenario.

As for the aggregation of audio streams, the sampling rate issue you mentioned could indeed be a risk I might encounter. However, due to my lack of experience in developing real-time audio applications, I am temporarily unable to identify the problem immediately.

So far, xt-audio has solved many issues for me. Please wait for me to practice for some time and then I will provide you with feedback. Thank you once again.

Closing for now. Feel free to reopen.

sjoerdvankreel / xt-audio

I need help, simultaneously record both microphone and speaker audio #23