resemble-ai / Resemblyzer

A python package to analyze and compare voices with deep learning
Apache License 2.0
2.66k stars 419 forks source link

How to get embeddings of audio data streaming from microphone. #56

Open gaushh opened 3 years ago

gaushh commented 3 years ago

I am using resemblyzer to create embeddings for speaker diarization. It works fine when a whole wave file is loaded into the resemblyzer. Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks). A chunk is essentially a frame of fixed size (100 ms in my case). How do I get separate embedding for each chunk using resemblyzer?

CorentinJ commented 3 years ago

The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding. If you have that already, that's great.

Take a look at embed_utterance(). Partial embeddings are created by forwarding chunks of the mel spectrogram of the audio. These chunks are extracted from the audio at specific locations predetermined by compute_partial_slices. You can copy the code in embed_utterance() and call compute_partial_slices with a very large number to know where to split chunks in your streaming audio. Forward a chunk to get a single partial frame.

gaushh commented 3 years ago

The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding.

To do that I'm using code provided by Google for streaming speech recognition on an audio stream

I am getting embeddings but I believe that I'm doing something wrong since the clustering algo is producing a single class (cluster) while trying to perform speaker diarization on the extracted embedding

Here's what my code looks like :

        `import numpy as np
        import pyaudio
        from six.moves import queue

        from resemblyzer import preprocess_wav, VoiceEncoder
        from pathlib import Path

        from links_clustering.links_cluster import LinksCluster

        # Audio recording parameters
        RATE = 16000
        CHUNK = int(RATE)  # 100ms

        encoder = VoiceEncoder("cpu")
        links_cluster = LinksCluster(0.5, 0.5, 0.5)

        class MicrophoneStream(object):
            """Opens a recording stream as a generator yielding the audio chunks."""

            def __init__(self, rate, chunk):
                self._rate = rate
                self._chunk = chunk

                # Create a thread-safe buffer of audio data
                self._buff = queue.Queue()
                self.closed = True

            def __enter__(self):
                self._audio_interface = pyaudio.PyAudio()
                self._audio_stream =
                    # The API currently only supports 1-channel (mono) audio
                    # Run the audio stream asynchronously to fill the buffer object.
                    # This is necessary so that the input device's buffer doesn't
                    # overflow while the calling thread makes network requests, etc.

                self.closed = False

                return self

            def __exit__(self, type, value, traceback):
                self.closed = True
                # Signal the generator to terminate so that the client's
                # streaming_recognize method will not block the process termination.

            def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
                """Continuously collect data from the audio stream, into the buffer."""
                return None, pyaudio.paContinue

            def generator(self):
                while not self.closed:
                    # Use a blocking get() to ensure there's at least one chunk of
                    # data, and stop iteration if the chunk is None, indicating the
                    # end of the audio stream.
                    chunk = self._buff.get()
                    if chunk is None:
                    data = [chunk]
                    # Now consume whatever other data's still buffered.
                    while True:
                            chunk = self._buff.get(block=False)
                            if chunk is None:
                        except queue.Empty:
                    yield b"".join(data)

        def main():

            with MicrophoneStream(RATE, CHUNK) as stream:
                audio_generator = stream.generator()
                for content in audio_generator:
                    numpy_array = np.frombuffer(content, dtype=np.float32)
                    wav = preprocess_wav(numpy_array)
                    _, cont_embeds, wav_splits = encoder.embed_utterance(wav, return_partials=True, rate=16)
                    predicted_cluster = links_cluster.predict(cont_embeds[0])
                    print("predicted_cluster :", predicted_cluster)

        def write_frame(file_name, data):
            wf =, 'wb')

        if __name__ == "__main__":
milind-soni commented 2 years ago

How to avoid losing information when you split a file into chunks.

MichaelScofield123 commented 2 years ago

I am using resemblyzer to create embeddings for speaker diarization. It works fine when a whole wave file is loaded into the resemblyzer. Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks). A chunk is essentially a frame of fixed size (100 ms in my case). How do I get separate embedding for each chunk using resemblyzer?

I am also trying to implement this function, have you implemented it, or have any good suggestions?My email is,hope your reply.Thanks.