resemble-ai / Resemblyzer

A python package to analyze and compare voices with deep learning
Apache License 2.0
2.67k stars 419 forks source link

Compute embeddings from stream & unsupervised diarization #10

Closed shashankpr closed 4 years ago

shashankpr commented 4 years ago

Hi, great work and great repo really. Your code and examples helped me understand the flow very easily. I am currently working on a speaker identification task wherein I want to detect "who spoke when" with low latency. There are two tasks that I need to overcome and I was wondering if you had already worked on them or have plans to in future. If not, then I would be glad to contribute to your repo as a PR. The tasks are as follows:

  1. How can I use the partial embeddings to identify speaker changes if I do not have pre-defined speaker embeddings (unlike the speaker diarization example that you gave)?
  2. Can the embeddings be computed from a streaming input? Like directly reading wav bytes from microphone and computing them?

I know that they can be done with few tweaks but I would like to know your insight on them if you had already worked or have idea about them. Thanks!

CorentinJ commented 4 years ago

I've investigated these areas but haven't yet implemented anything for them, even though I am considering it.

  1. You would have to cluster the partial embeddings of the audio (generated with a moderately high rate, I'd use 4). There must be papers on how to do this out there, but you could try some intuitive approaches too. For example you could try to use some clustering algorithm that will create n + 1 clusters (where n is the known number of speakers), and hope that it will assign embeddings to the right clusters and keep a bin state. You might be able to filter out embeddings of clear speech from a single person from those computed from noise/silence or multiple speakers.

You might also be able to work with similarity. E.g. if you add these lines in demo 2 after having computed the continuous embedding:

import matplotlib.pyplot as plt
plt.imshow(cont_embeds @ cont_embeds.T)
plt.show()

You will get this: image

Clearly you can detect some speakers there, by looking for pattern of high similarity: image image

  1. This is definitely achievable. The sounddevice module can record audio and stream in real-time to numpy arrays, so you can work with that. You can then decompose the embed_utterance function to achieve your goal. Define a maximum duration for your audio (it can be an order of magnitude higher than necessary, that's not a problem) and compute the wav slices based on that length: https://github.com/resemble-ai/Resemblyzer/blob/master/resemblyzer/voice_encoder.py#L141. From the wav slices, you will know when you will be able to grab a partial wav from the numpy array being streamed to. For this partial wav, create a unique spectrogram and forward it (with a batch size of 1), and you will have a partial embedding. Keep doing this while the audio is being recorded.
CorentinJ commented 4 years ago

This is a demo I meant to make too, but it's certainly more work than the other 5. Hope we'll get there.

shashankpr commented 4 years ago

Thanks for your detailed explanations.

  1. I agree with you. I have been reading about spectral clustering method which has been used in couple of papers for similar diarization task. I will follow your suggestion and try it out.
  2. When you mention a batch size of 1, it means that the partial embedding output will have a shape of (number_of_partials, embedding_size) correct?
CorentinJ commented 4 years ago

I mean that at this point in the function: https://github.com/resemble-ai/Resemblyzer/blob/master/resemblyzer/voice_encoder.py#L151, the variable mels has shape (N, 160, 40), where N is the batch size. You will probably end up with a mel of shape (160, 40) so you will have to add an extra dimension (e.g. by doing mels[None, ...]) before forward the mel.

shashankpr commented 4 years ago

Got it! Thank you very much for clearing these doubts. I will close this and will update here when I will make significant progress with unsupervised and streaming diarization. Great work once again!

CorentinJ commented 4 years ago

Sure, it's fine if you leave it open until we figure it out.

nikitalpopov commented 4 years ago

Hi, @shashankpr Any progress on this task?

lonniehartley commented 4 years ago

Hello, could you please refferce the task descriptions in your email as there are many.

Thank you, Lonnie Hartley

Get Outlook for Androidhttps://aka.ms/ghei36


From: Nikita Popov notifications@github.com Sent: Wednesday, April 8, 2020 7:31:52 AM To: resemble-ai/Resemblyzer Resemblyzer@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [resemble-ai/Resemblyzer] Compute embeddings from stream & unsupervised diarization (#10)

Hi, @shashankprhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fshashankpr&data=02%7C01%7C%7C0b3065ffb29c4e4c52b408d7dbc9948a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637219531125965499&sdata=08CV284zKbbKQ5KUQ%2BLVobBufEFwQ9txsPaVA1uzeVU%3D&reserved=0 Any progress on this task?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fresemble-ai%2FResemblyzer%2Fissues%2F10%23issuecomment-610994042&data=02%7C01%7C%7C0b3065ffb29c4e4c52b408d7dbc9948a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637219531125965499&sdata=8XylhOBft%2BdTybW18LgGBiVB1rhRiKQ2gx0c1fQJWrU%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAMCVZK5T6HE5TYMLUZWEOT3RLSDFRANCNFSM4IZHG45A&data=02%7C01%7C%7C0b3065ffb29c4e4c52b408d7dbc9948a%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637219531125975485&sdata=RaxDCOOSpaiBJBjJZJ7WMWhdslp%2BeOpjdvbgKQmgZ7g%3D&reserved=0.

shashankpr commented 4 years ago

Hi @nikitalpopov , I have been doing some experiments around this but haven't really got proper time to implement something good. I am going to start working on it from this week and I will update you if I make any progress

nikitalpopov commented 4 years ago

@shashankpr Could I help you with something?

nikitalpopov commented 4 years ago

@CorentinJ @shashankpr I tried to make it by myself, but results are horrible (DER is not getting any better than 60%). Could you, please, check my test notebook? https://github.com/nikitalpopov/master/blob/dev/demo.ipynb

RubenPants commented 2 years ago

Writing my solution here, since I've been trying to implement a way of embedding during streaming. In my use-case, streaming happens by pushing bytes of audio segments:

import io
import numpy as np
import soundfile as sf
from resemblyzer import VoiceEncoder

encoder = VoiceEncoder()

def embed(chunk_bytes: bytes) -> np.ndarray:
    """Embed the given chunk of WAV-bytes."""
    data, _ = sf.read(
            io.BytesIO(chunk_bytes),
            samplerate=16000,
            channels=1,
            format='RAW',
            subtype='PCM_16',
            endian='FILE',
    )
    return encoder.embed_utterance(data)

An example of this code's result (after PCA) are shown below: image