Get rid of ONNX WeSpeaker in favor of its pytorch implementation

hbredin commented 1 year ago

Since its introduction in pyannote.audio 3.x, the ONNX dependency seems to cause lots of problem to pyannote users: #1526 #1523 #1517 #1510 #1508 #1481 #1478 #1477 #1475

WeSpeaker does provide a pytorch implementation of its pretrained ResNet models.

Let's use this!

hbredin commented 1 year ago

Among the people who raised their thumb on this issue, anyone wants to take care of it?

wsstriving commented 1 year ago

Hi, I am the initiator of Wespeaker, thanks for the interest of our toolkit! We will update wespeaker to support installation and load pytorch model in a way such as "model = wespeaker.load_pytorch_model" very soon(currently we support wespeaker.load_model, but it's onnx), then I will open a PR to "pipelines/speaker_verification"

hbredin commented 1 year ago

Thanks @wsstriving! I worked on this a few days ago and already have a working prototype.

Instead of adding one more dependency to pyannote.audio, I was planning to copy the part of the WeSpeaker into a new pyannote.audio.models.embedding.wespeaker module.

I am just stuck with the fact that WeSpeaker uses Apache-2.0 license, while pyannote uses MIT license. Both are permissive but I am not quite sure where and how to mention WeSpeaker license into pyannote codebase. Would putting it at the top of the pyannote.audio.models.embedding.wespeaker directory be enough?

Another option that I am considering is adding embedding entrypoint to pyannote.audio so that any external libraries can actually provide embeddings usable in pyannote as long as they follow the API. What do you think?

wsstriving commented 1 year ago

Thanks @wsstriving! I worked on this a few days ago and already have a working prototype.

Instead of adding one more dependency to pyannote.audio, I was planning to copy the part of the WeSpeaker into a new pyannote.audio.models.embedding.wespeaker module.

I am just stuck with the fact that WeSpeaker uses Apache-2.0 license, while pyannote uses MIT license. Both are permissive but I am not quite sure where and how to mention WeSpeaker license into pyannote codebase. Would putting it at the top of the pyannote.audio.models.embedding.wespeaker directory be enough?

Another option that I am considering is adding embedding entrypoint to pyannote.audio so that any external libraries can actually provide embeddings usable in pyannote as long as they follow the API. What do you think?

Hi Bredin, I think it's just fine for the first option. We implemented the CLI support and you can check it here https://github.com/wenet-e2e/wespeaker/blob/master/docs/python_package.md

Now, it's easy to use the wespeaker model in pytorch as:

import wespeaker
model = wespeaker.load_model('english')
model.set_gpu(0)
print(model.model)

# model.model(feats)

Check https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/cli/speaker.py#L63 for more details to use it.

hbredin commented 1 year ago

Quick update:

PR #1540 adds support for pytorch-based WeSpeaker model
PR #1541 removes onnxruntime dependency

Could any of you (who raised their thumbs) try the following:

checkout PR #1540
instantiate a speaker diarization pipeline with pyannote/wespeaker-voxceleb-resnet34-LM in place of hbredin/wespeaker-voxceleb-resnet34-LM
run the pipeline on CPU
run the pipeline on GPU
report back in this issue

stygmate commented 1 year ago

@hbredin

I made a quick test, I don't have checked the results and i'm unsure of the pipeline def.

what i run:

from pyannote.audio.pipelines import SpeakerDiarization
from pyannote.audio.pipelines.utils.hook import ProgressHook
import torch

pipeline = SpeakerDiarization(segmentation="pyannote/segmentation-3.0",embedding="pyannote/wespeaker-voxceleb-resnet34-LM")

pipeline.instantiate({
    "segmentation": {
        "min_duration_off": 0.0,
    },
    "clustering": {
        "method": "centroid",
        "min_cluster_size": 12,
        "threshold": 0.7045654963945799,
    },
})

pipeline.to(torch.device("mps"))

with ProgressHook() as hook:
    diarization = pipeline("./download/test.wav", hook=hook)

i got this warning: Model was trained with pyannote.audio 2.1.1, yours is 3.0.1. Bad things might happen unless you revert pyannote.audio to 2.x.

Seems to work on CPU.

For GPU (mac m1 max) i got this error: NotImplementedError: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variablePYTORCH_ENABLE_MPS_FALLBACK=1to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. after setting the env var it work but with a mix of gpu and cpu.

hbredin commented 1 year ago

Thanks @stygmate for the feedback.

To use the same setup as pyannote/speaker-diarization-3.0, one should use the following:

from pyannote.audio.pipelines import SpeakerDiarization
from pyannote.audio.pipelines.utils.hook import ProgressHook
from pyannote.audio import Audio
import torch

pipeline = SpeakerDiarization(
    segmentation="pyannote/segmentation-3.0",
    segmentation_batch_size=32
    embedding="pyannote/wespeaker-voxceleb-resnet34-LM",
    embedding_exclude_overlap=True,
    embedding_batch_size=32)

# other values of `*_batch_size` may lead to faster processing. 
# the larger may not necessarily be the faster.

pipeline.instantiate({
    "segmentation": {
        "min_duration_off": 0.0,
    },
    "clustering": {
        "method": "centroid",
        "min_cluster_size": 12,
        "threshold": 0.7045654963945799,
    },
})

# send the pipeline to your prefered device
device = torch.device("cpu") 
device = torch.device("cuda")
device = torch.device("mps")  
pipeline.to(device)

# load audio in memory (usually leads to faster processing)
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io(audio)
file = {"waveform": waveform, "sample_rate": sample_rate}

# process the audio 
with ProgressHook() as hook:
    diarization = pipeline(file, hook=hook)

I'd love to get feedback from you all regarding possible algorithmic or speed regressions .

stygmate commented 1 year ago

@hbredin Give me a wav file to process, I will send you the results.

hbredin commented 1 year ago

Closing as latest version no longer relies on ONNX runtime. Please update to pyannote.audio 3.1 and pyannote/speaker-diarization-3.1 (and open new issues if needed).

magicse commented 1 year ago

It work ok . But I use torch 1.XX segmentation ---------------------------------------- 100% 0:00:09 speaker_counting ---------------------------------------- 100% 0:00:00 embeddings ---------------------------------------- 100% 0:06:39 discrete_diarization ---------------------------------------- 100% 0:00:00

And i made some changes for comptability with torch 1.xx and torch 2.xx file \pyannote\audio\models\embedding\wespeaker__init__.py

if torch.__version__ >= "2.0.0":
    # Use torch.vmap for torch 2.0 or newer
    from torch import vmap
else:
    # Use functorch.vmap for torch 1.12 or older
    from functorch import vmap

And change features = torch.vmap(self._fbank)(waveforms.to(fft_device)).to(device) to features = vmap(self._fbank)(waveforms.to(fft_device)).to(device)

Same changes in file \pyannote\audio\models\blocks\pooling.py

hbredin commented 1 year ago

Thanks for the feedback (and the PR!). However, I don't plan to support torch 1.x in the future.

pyannote / pyannote-audio

Get rid of ONNX WeSpeaker in favor of its pytorch implementation #1537