pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.38k stars 784 forks source link

How can I load my own embeddings from a custom model #1287

Closed ItakeLs closed 1 year ago

ItakeLs commented 1 year ago

Hello, I am asking if it is possible to load embeddings from a custom transformer-encoder model, instead of using the one provided by speech brain. Each embedding covers 45s of the audio.

# get the embeddings
chunk_embeddings = [chunk["embedding"] for chunk in chunks]

create the embeddings and format them:

import numpy as np
from pyannote.core import SlidingWindow, SlidingWindowFeature

concatenated_embeddings = np.concatenate(chunk_embeddings, axis=0)

frame_duration = 0.05
frame_step = 0.05

window = SlidingWindow(duration=frame_duration, step=frame_step)
sliding_window_features = SlidingWindowFeature(concatenated_embeddings, window)

Perform clustering and convert clusters to speaker diarization labels:

from pyannote.audio.pipelines.clustering import HiddenMarkovModelClustering

# Based on your configuration file
covariance_type = "diag"
threshold = 0.35

clustering = HiddenMarkovModelClustering(covariance_type=covariance_type, threshold=threshold)
diarization = clustering(sliding_window_features)

for segment, label in diarization.itertracks(yield_label=True):
   pass # do post processing here

However, I have looked through your speaker diarization pipeline because I could not get the clustering work, and there was a lot more code under the hood. I would appreciate it if you can guide me in the right direction on how to use my own model embedding in the speaker diarization pipeline?

github-actions[bot] commented 1 year ago

We found the following entry in the FAQ which you may find helpful:

Feel free to close this issue if you found an answer in the FAQ. Otherwise, please give us a little time to review.

This is an automated reply, generated by FAQtory

hbredin commented 1 year ago

This is not supported out-of-the-box. You'll have to edit PretrainedSpeakerEmbedding function and add support for your own embedding.

https://github.com/pyannote/pyannote-audio/blob/74939acbfa830521a434cb4068176196dd9612dc/pyannote/audio/pipelines/speaker_verification.py#L451

ItakeLs commented 1 year ago

Thanks, I appreciate the help.