Running `speaker-diarization-3.1` with local ` wespeaker-voxceleb-resnet34-LM` needs special naming to circumvent ONNX/protobuf loading errors

simonottenhauskenbun commented 9 months ago

Tested versions

Versions to reproduce:

pyannote.audio==3.1.1 # relevant code has not changed in 3.1.1
pyannote.audio==3.1.0 
pyannote.core==5.0.0
pyannote.database==5.0.1
pyannote.metrics==3.2.1
pyannote.pipeline==3.0.1

System information

Ubuntu 22.04.4 LTS - NVIDIA A100 PCIe 40GB - python 3.11

Issue description

First of all, thank you Hervé BREDIN for the gread work you are doing here!

I am in the process of combining your speaker diarization with asr models (non-whisper) and I need to run speaker-diarization-3.1 using local files only.

When I try to load the model I get ONNX/protobuf loading errors

How to reproduce

Download

https://huggingface.co/pyannote/speaker-diarization-3.1/blob/main/config.yaml --> models/config.yaml
https://huggingface.co/pyannote/segmentation-3.0/blob/main/pytorch_model.bin --> models/pytorch_model_segmentation-3.0.bin
https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM/blob/main/pytorch_model.bin --> models/wespeaker-voxceleb-resnet34-LM.bin

to models/

Adjusted the config to point to the local files:

version: 3.1.0

pipeline:
  name: pyannote.audio.pipelines.SpeakerDiarization
  params:
    clustering: AgglomerativeClustering
    # embedding: pyannote/wespeaker-voxceleb-resnet34-LM
    embedding: models/wespeaker-voxceleb-resnet34-LM.bin
    embedding_batch_size: 32
    embedding_exclude_overlap: true
    segmentation: models/pytorch_model_segmentation-3.0.bin
    # segmentation: pyannote/segmentation-3.0
    segmentation_batch_size: 32

params:
  clustering:
    method: centroid
    min_cluster_size: 12
    threshold: 0.7045654963945799
  segmentation:
    min_duration_off: 0.0

Install onnxruntime: pip install onnxruntime - if not installed, ImportError: 'onnxruntime' must be installed to use 'models/model_wespeaker-voxceleb-resnet34-LM.bin' embeddings. is raised from https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/pipelines/speaker_verification.py#L420

Load the pipeline:

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("/path/to/adapted/config.yaml)  # <-- this raises the protobuf error

Location of the bug

I've tracked down the source of the bug: https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/pipelines/speaker_verification.py#L712

# .venv/lib/python3.11/site-packages/pyannote/audio/pipelines/speaker_verification.py
def PretrainedSpeakerEmbedding(
    embedding: PipelineModel,
    device: torch.device = None,
    use_auth_token: Union[Text, None] = None,
):
    #...
    if isinstance(embedding, str) and "pyannote" in embedding:
        return PyannoteAudioPretrainedSpeakerEmbedding(
            embedding, device=device, use_auth_token=use_auth_token
        )

    elif isinstance(embedding, str) and "speechbrain" in embedding:
        return SpeechBrainPretrainedSpeakerEmbedding(
            embedding, device=device, use_auth_token=use_auth_token
        )

    elif isinstance(embedding, str) and "nvidia" in embedding:
        return NeMoPretrainedSpeakerEmbedding(embedding, device=device)

    elif isinstance(embedding, str) and "wespeaker" in embedding:
        return ONNXWeSpeakerPretrainedSpeakerEmbedding(embedding, device=device)  # <-- this is called, but the wespeaker-voxceleb-resnet34-LM is not an ONNX model

    else:
        # fallback to pyannote in case we are loading a local model
        return PyannoteAudioPretrainedSpeakerEmbedding(
            embedding, device=device, use_auth_token=use_auth_token
        )

Workaround

The workaround is to iclude pyannote in the path to the model, so the first if statement triggers, loading the wespeaker-voxceleb-resnet34-LM as a pyannote model.

Changed config:

version: 3.1.0

pipeline:
  name: pyannote.audio.pipelines.SpeakerDiarization
  params:
    clustering: AgglomerativeClustering
    # embedding: pyannote/wespeaker-voxceleb-resnet34-LM
    # embedding: models/model_wespeaker-voxceleb-resnet34-LM.bin  # <-- does not work, model type guessing code guesses the wrong model type
    embedding: models/pyannote_model_wespeaker-voxceleb-resnet34-LM.bin # <-- this works, due to 'pyannote' in the path
    embedding_batch_size: 32
    embedding_exclude_overlap: true
    segmentation: models/pytorch_model_segmentation-3.0.bin
    # segmentation: pyannote/segmentation-3.0
    segmentation_batch_size: 32

params:
  clustering:
    method: centroid
    min_cluster_size: 12
    threshold: 0.7045654963945799
  segmentation:
    min_duration_off: 0.0

How to fix (suggestions)

Write some documentation how to load local models, pointing out this behaviour - maybe that already exists?
Add some print statement to the model type guesing code to inform the user about what is happening
Make model type explicit, rather than inferring it from the file name

Minimal reproduction example (MRE)

https://gist.github.com/simonottenhauskenbun/cf05d31b53055750e62a06ef5e317462

hbredin commented 9 months ago

Thanks. Would you contribute proposed suggestion (1.)?

simonottenhauskenbun commented 9 months ago

Sure! Working on it...

seanzhang-zhichen commented 8 months ago

onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model /data/models/wespeaker-voxceleb-resnet34-LM/pytorch_model.bin failed:Protobuf parsing failed.

simonottenhauskenbun commented 8 months ago

https://github.com/pyannote/pyannote-audio/pull/1662 aims to fix this by "Write some documentation how to load local models, pointing out this behaviour"

hbredin commented 8 months ago

I did get the notification for PR #1662. Thanks. Will have a look when things get quieter on my side...

pyannote / pyannote-audio