Closed davide-scalzo closed 3 years ago
The configuration file for the pretrained diarization pipeline dia
does indeed rely on 2s embeddings:
pipeline:
name: pyannote.audio.pipeline.speaker_diarization.SpeakerDiarization
params:
sad_scores:
sad_dihard:
duration: 2.0
step: 0.1
scd_scores:
scd_dihard:
duration: 2.0
step: 0.1
embedding:
emb_voxceleb:
duration: 2.0
step: 0.05
metric: cosine
method: affinity_propagation
You could try re-training a pipeline on your data with shorter durations.
You might also want to have a look at this related issue.
I see thank you @hbredin , I assume that requires training also each individual model from scratch?
Not necessarily.
Don't change SAD and SCD durations, they output scores every few milliseconds anyway.
For embeddings, that is different. Start by only retraining the pipeline with shorter duration. If that does not work, try fine-tuning the embedding model on your data and with shorter duration.
Fantastic, thank you.
One last thing, in order to retrain the pipeline with shorter duration, where would that be setup? --help
shows this output
................... <experiment_dir>/config.yml ...................
pipeline:
name: Yin2018
params:
sad: tutorials/pipeline/sad
scd: tutorials/pipeline/scd
emb: tutorials/pipeline/emb
metric: angular
# preprocessors can be used to automatically add keys into
# each (dict) file obtained from pyannote.database protocols.
preprocessors:
audio: ~/.pyannote/db.yml # load template from YAML file
video: ~/videos/{uri}.mp4 # define template directly
# one can freeze some hyper-parameters if needed (e.g. when
# only part of the pipeline needs to be updated)
freeze:
speech_turn_segmentation:
speech_activity_detection:
onset: 0.5
offset: 0.5
...................................................................
"train" mode:
Tune the pipeline hyper-parameters
<experiment_dir>/<database.task.protocol>.<subset>.yml
"best" mode:
Display current best loss and corresponding hyper-paramters.
"apply" mode
Apply the pipeline (with best set of hyper-parameters)
would that be something like
pipeline:
name: pyannote.audio.pipeline.speaker_diarization.SpeakerDiarization
params:
sad_scores: ${EXP_DIR}/sad_ami
scd_scores: ${EXP_DIR}/scd_ami
embedding: custom_emb_ami_path
method: affinity_propagation
where custom_emb_ami_path is produced by pyannote-audio emb apply --step=0.1 --duration=0.5 --pretrained=emb_ami --subset=${SUBSET} ${EXP_DIR} AMI.SpeakerDiarization.MixHeadset
?
Yes, exactly. See also this issue for an alternative.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @hbredin , great repo and talk!
I'm trying to perform diarization to detect number of speakers in a short question and answer clip.
Clips are between 3 and 10 seconds on average and I want to verify that there is an answer present by detecting a second speaker utterance (and possibly extract that as a second piece of audio).
However the pretrained pipeline fails to detect second speaker on my audio, but works very well on longer audio files (e.g. voxconverse).
I assume that is because the embedding model is looking at 2 seconds time windows and the answer might be a short
yes
orno
Is there any way to pass a duration parameter to the embedding model?
If not, how best to achieve this?