Embeddings with short time windows

davide-scalzo commented 3 years ago

Hi @hbredin , great repo and talk!

I'm trying to perform diarization to detect number of speakers in a short question and answer clip.

Clips are between 3 and 10 seconds on average and I want to verify that there is an answer present by detecting a second speaker utterance (and possibly extract that as a second piece of audio).

However the pretrained pipeline fails to detect second speaker on my audio, but works very well on longer audio files (e.g. voxconverse).

I assume that is because the embedding model is looking at 2 seconds time windows and the answer might be a short yes or no

Is there any way to pass a duration parameter to the embedding model?

If not, how best to achieve this?

hbredin commented 3 years ago

The configuration file for the pretrained diarization pipeline dia does indeed rely on 2s embeddings:

pipeline:
   name: pyannote.audio.pipeline.speaker_diarization.SpeakerDiarization
   params:
      sad_scores:
         sad_dihard:
            duration: 2.0
            step: 0.1
      scd_scores:
         scd_dihard:
            duration: 2.0
            step: 0.1
      embedding:
         emb_voxceleb:
            duration: 2.0
            step: 0.05
      metric: cosine
      method: affinity_propagation

You could try re-training a pipeline on your data with shorter durations.

You might also want to have a look at this related issue.

davide-scalzo commented 3 years ago

I see thank you @hbredin , I assume that requires training also each individual model from scratch?

hbredin commented 3 years ago

Not necessarily.

Don't change SAD and SCD durations, they output scores every few milliseconds anyway.

For embeddings, that is different. Start by only retraining the pipeline with shorter duration. If that does not work, try fine-tuning the embedding model on your data and with shorter duration.

davide-scalzo commented 3 years ago

Fantastic, thank you.

One last thing, in order to retrain the pipeline with shorter duration, where would that be setup? --help shows this output

  ................... <experiment_dir>/config.yml ...................
    pipeline:
       name: Yin2018
       params:
          sad: tutorials/pipeline/sad
          scd: tutorials/pipeline/scd
          emb: tutorials/pipeline/emb
          metric: angular

    # preprocessors can be used to automatically add keys into
    # each (dict) file obtained from pyannote.database protocols.
    preprocessors:
       audio: ~/.pyannote/db.yml   # load template from YAML file
       video: ~/videos/{uri}.mp4   # define template directly

    # one can freeze some hyper-parameters if needed (e.g. when
    # only part of the pipeline needs to be updated)
    freeze:
       speech_turn_segmentation:
          speech_activity_detection:
              onset: 0.5
              offset: 0.5
    ...................................................................

"train" mode:
    Tune the pipeline hyper-parameters
        <experiment_dir>/<database.task.protocol>.<subset>.yml

"best" mode:
    Display current best loss and corresponding hyper-paramters.

"apply" mode
    Apply the pipeline (with best set of hyper-parameters)

would that be something like

pipeline:
   name: pyannote.audio.pipeline.speaker_diarization.SpeakerDiarization
   params:
      sad_scores: ${EXP_DIR}/sad_ami
      scd_scores: ${EXP_DIR}/scd_ami
      embedding: custom_emb_ami_path
      method: affinity_propagation

where custom_emb_ami_path is produced by pyannote-audio emb apply --step=0.1 --duration=0.5 --pretrained=emb_ami --subset=${SUBSET} ${EXP_DIR} AMI.SpeakerDiarization.MixHeadset ?

hbredin commented 3 years ago

Yes, exactly. See also this issue for an alternative.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pyannote / pyannote-audio

Embeddings with short time windows #458