How to get embeddings of audio data streaming from microphone.

gaushh commented 3 years ago

I am using resemblyzer to create embeddings for speaker diarization. It works fine when a whole wave file is loaded into the resemblyzer. Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks). A chunk is essentially a frame of fixed size (100 ms in my case). How do I get separate embedding for each chunk using resemblyzer?

CorentinJ commented 3 years ago

The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding. If you have that already, that's great.

Take a look at embed_utterance(). Partial embeddings are created by forwarding chunks of the mel spectrogram of the audio. These chunks are extracted from the audio at specific locations predetermined by compute_partial_slices. You can copy the code in embed_utterance() and call compute_partial_slices with a very large number to know where to split chunks in your streaming audio. Forward a chunk to get a single partial frame.

gaushh commented 3 years ago

The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding.

To do that I'm using code provided by Google for streaming speech recognition on an audio stream

I am getting embeddings but I believe that I'm doing something wrong since the clustering algo is producing a single class (cluster) while trying to perform speaker diarization on the extracted embedding

Here's what my code looks like :

        `import numpy as np
        import pyaudio
        from six.moves import queue

        from resemblyzer import preprocess_wav, VoiceEncoder
        from pathlib import Path

        from links_clustering.links_cluster import LinksCluster

        # Audio recording parameters
        RATE = 16000
        CHUNK = int(RATE)  # 100ms

        encoder = VoiceEncoder("cpu")
        links_cluster = LinksCluster(0.5, 0.5, 0.5)

        class MicrophoneStream(object):
            """Opens a recording stream as a generator yielding the audio chunks."""

            def __init__(self, rate, chunk):
                self._rate = rate
                self._chunk = chunk

                # Create a thread-safe buffer of audio data
                self._buff = queue.Queue()
                self.closed = True

            def __enter__(self):
                self._audio_interface = pyaudio.PyAudio()
                self._audio_stream = self._audio_interface.open(
                    format=pyaudio.paInt16,
                    # The API currently only supports 1-channel (mono) audio
                    # https://goo.gl/z757pE
                    channels=2,
                    rate=self._rate,
                    input=True,
                    frames_per_buffer=self._chunk,
                    # Run the audio stream asynchronously to fill the buffer object.
                    # This is necessary so that the input device's buffer doesn't
                    # overflow while the calling thread makes network requests, etc.
                    stream_callback=self._fill_buffer,
                )

                self.closed = False

                return self

            def __exit__(self, type, value, traceback):
                self._audio_stream.stop_stream()
                self._audio_stream.close()
                self.closed = True
                # Signal the generator to terminate so that the client's
                # streaming_recognize method will not block the process termination.
                self._buff.put(None)
                self._audio_interface.terminate()

            def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
                """Continuously collect data from the audio stream, into the buffer."""
                #print("len(in_data)",len(in_data))
                self._buff.put(in_data)
                return None, pyaudio.paContinue

            def generator(self):
                while not self.closed:
                    # Use a blocking get() to ensure there's at least one chunk of
                    # data, and stop iteration if the chunk is None, indicating the
                    # end of the audio stream.
                    chunk = self._buff.get()
                    if chunk is None:
                        return
                    data = [chunk]
                    # Now consume whatever other data's still buffered.
                    while True:
                        try:
                            chunk = self._buff.get(block=False)
                            if chunk is None:
                                return
                            data.append(chunk)
                        except queue.Empty:
                            break
                    yield b"".join(data)

        def main():

            with MicrophoneStream(RATE, CHUNK) as stream:
                audio_generator = stream.generator()
                for content in audio_generator:
                    numpy_array = np.frombuffer(content, dtype=np.float32)
                    wav = preprocess_wav(numpy_array)
                    _, cont_embeds, wav_splits = encoder.embed_utterance(wav, return_partials=True, rate=16)
                    predicted_cluster = links_cluster.predict(cont_embeds[0])
                    print("predicted_cluster :", predicted_cluster)
                    print("------------")

        def write_frame(file_name, data):
            wf = wave.open(file_name, 'wb')
            wf.setnchannels(CHANNELS)
            wf.setsampwidth(p.get_sample_size(FORMAT))
            wf.setframerate(RATE)
            wf.writeframes(data)
            wf.close()
            return

        if __name__ == "__main__":
            main()`

milind-soni commented 3 years ago

How to avoid losing information when you split a file into chunks.

MichaelScofield123 commented 2 years ago

I am using resemblyzer to create embeddings for speaker diarization. It works fine when a whole wave file is loaded into the resemblyzer. Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks). A chunk is essentially a frame of fixed size (100 ms in my case). How do I get separate embedding for each chunk using resemblyzer?

I am also trying to implement this function, have you implemented it, or have any good suggestions?My email is 1242689497@qq.com,hope your reply.Thanks.

resemble-ai / Resemblyzer

How to get embeddings of audio data streaming from microphone. #56