Open gaushh opened 3 years ago
The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding. If you have that already, that's great.
Take a look at embed_utterance()
. Partial embeddings are created by forwarding chunks of the mel spectrogram of the audio. These chunks are extracted from the audio at specific locations predetermined by compute_partial_slices
. You can copy the code in embed_utterance()
and call compute_partial_slices
with a very large number to know where to split chunks in your streaming audio. Forward a chunk to get a single partial frame.
The difficult part of the implementation is to get a reliable system for receiving these chunks and for triggering a function call when enough chunks are gathered to compute an embedding.
To do that I'm using code provided by Google for streaming speech recognition on an audio stream
I am getting embeddings but I believe that I'm doing something wrong since the clustering algo is producing a single class (cluster) while trying to perform speaker diarization on the extracted embedding
Here's what my code looks like :
`import numpy as np
import pyaudio
from six.moves import queue
from resemblyzer import preprocess_wav, VoiceEncoder
from pathlib import Path
from links_clustering.links_cluster import LinksCluster
# Audio recording parameters
RATE = 16000
CHUNK = int(RATE) # 100ms
encoder = VoiceEncoder("cpu")
links_cluster = LinksCluster(0.5, 0.5, 0.5)
class MicrophoneStream(object):
"""Opens a recording stream as a generator yielding the audio chunks."""
def __init__(self, rate, chunk):
self._rate = rate
self._chunk = chunk
# Create a thread-safe buffer of audio data
self._buff = queue.Queue()
self.closed = True
def __enter__(self):
self._audio_interface = pyaudio.PyAudio()
self._audio_stream = self._audio_interface.open(
format=pyaudio.paInt16,
# The API currently only supports 1-channel (mono) audio
# https://goo.gl/z757pE
channels=2,
rate=self._rate,
input=True,
frames_per_buffer=self._chunk,
# Run the audio stream asynchronously to fill the buffer object.
# This is necessary so that the input device's buffer doesn't
# overflow while the calling thread makes network requests, etc.
stream_callback=self._fill_buffer,
)
self.closed = False
return self
def __exit__(self, type, value, traceback):
self._audio_stream.stop_stream()
self._audio_stream.close()
self.closed = True
# Signal the generator to terminate so that the client's
# streaming_recognize method will not block the process termination.
self._buff.put(None)
self._audio_interface.terminate()
def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
"""Continuously collect data from the audio stream, into the buffer."""
#print("len(in_data)",len(in_data))
self._buff.put(in_data)
return None, pyaudio.paContinue
def generator(self):
while not self.closed:
# Use a blocking get() to ensure there's at least one chunk of
# data, and stop iteration if the chunk is None, indicating the
# end of the audio stream.
chunk = self._buff.get()
if chunk is None:
return
data = [chunk]
# Now consume whatever other data's still buffered.
while True:
try:
chunk = self._buff.get(block=False)
if chunk is None:
return
data.append(chunk)
except queue.Empty:
break
yield b"".join(data)
def main():
with MicrophoneStream(RATE, CHUNK) as stream:
audio_generator = stream.generator()
for content in audio_generator:
numpy_array = np.frombuffer(content, dtype=np.float32)
wav = preprocess_wav(numpy_array)
_, cont_embeds, wav_splits = encoder.embed_utterance(wav, return_partials=True, rate=16)
predicted_cluster = links_cluster.predict(cont_embeds[0])
print("predicted_cluster :", predicted_cluster)
print("------------")
def write_frame(file_name, data):
wf = wave.open(file_name, 'wb')
wf.setnchannels(CHANNELS)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(data)
wf.close()
return
if __name__ == "__main__":
main()`
How to avoid losing information when you split a file into chunks.
I am using resemblyzer to create embeddings for speaker diarization. It works fine when a whole wave file is loaded into the resemblyzer. Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks). A chunk is essentially a frame of fixed size (100 ms in my case). How do I get separate embedding for each chunk using resemblyzer?
I am also trying to implement this function, have you implemented it, or have any good suggestions?My email is 1242689497@qq.com,hope your reply.Thanks.
I am using resemblyzer to create embeddings for speaker diarization. It works fine when a whole wave file is loaded into the resemblyzer. Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks). A chunk is essentially a frame of fixed size (100 ms in my case). How do I get separate embedding for each chunk using resemblyzer?