Closed hbredin closed 1 year ago
Among the people who raised their thumb on this issue, anyone wants to take care of it?
Hi, I am the initiator of Wespeaker, thanks for the interest of our toolkit! We will update wespeaker to support installation and load pytorch model in a way such as "model = wespeaker.load_pytorch_model" very soon(currently we support wespeaker.load_model, but it's onnx), then I will open a PR to "pipelines/speaker_verification"
Thanks @wsstriving! I worked on this a few days ago and already have a working prototype.
Instead of adding one more dependency to pyannote.audio
, I was planning to copy the part of the WeSpeaker into a new pyannote.audio.models.embedding.wespeaker
module.
I am just stuck with the fact that WeSpeaker uses Apache-2.0 license, while pyannote uses MIT license. Both are permissive but I am not quite sure where and how to mention WeSpeaker license into pyannote codebase. Would putting it at the top of the pyannote.audio.models.embedding.wespeaker
directory be enough?
Another option that I am considering is adding embedding
entrypoint to pyannote.audio
so that any external libraries can actually provide embeddings usable in pyannote as long as they follow the API. What do you think?
Thanks @wsstriving! I worked on this a few days ago and already have a working prototype.
Instead of adding one more dependency to
pyannote.audio
, I was planning to copy the part of the WeSpeaker into a newpyannote.audio.models.embedding.wespeaker
module.I am just stuck with the fact that WeSpeaker uses Apache-2.0 license, while pyannote uses MIT license. Both are permissive but I am not quite sure where and how to mention WeSpeaker license into pyannote codebase. Would putting it at the top of the
pyannote.audio.models.embedding.wespeaker
directory be enough?Another option that I am considering is adding
embedding
entrypoint topyannote.audio
so that any external libraries can actually provide embeddings usable in pyannote as long as they follow the API. What do you think?
Hi Bredin, I think it's just fine for the first option. We implemented the CLI support and you can check it here https://github.com/wenet-e2e/wespeaker/blob/master/docs/python_package.md
Now, it's easy to use the wespeaker model in pytorch as:
import wespeaker
model = wespeaker.load_model('english')
model.set_gpu(0)
print(model.model)
# model.model(feats)
Check https://github.com/wenet-e2e/wespeaker/blob/master/wespeaker/cli/speaker.py#L63 for more details to use it.
Quick update:
pytorch
-based WeSpeaker
model onnxruntime
dependencyCould any of you (who raised their thumbs) try the following:
pyannote/wespeaker-voxceleb-resnet34-LM
in place of hbredin/wespeaker-voxceleb-resnet34-LM
@hbredin
I made a quick test, I don't have checked the results and i'm unsure of the pipeline def.
what i run:
from pyannote.audio.pipelines import SpeakerDiarization
from pyannote.audio.pipelines.utils.hook import ProgressHook
import torch
pipeline = SpeakerDiarization(segmentation="pyannote/segmentation-3.0",embedding="pyannote/wespeaker-voxceleb-resnet34-LM")
pipeline.instantiate({
"segmentation": {
"min_duration_off": 0.0,
},
"clustering": {
"method": "centroid",
"min_cluster_size": 12,
"threshold": 0.7045654963945799,
},
})
pipeline.to(torch.device("mps"))
with ProgressHook() as hook:
diarization = pipeline("./download/test.wav", hook=hook)
i got this warning: Model was trained with pyannote.audio 2.1.1, yours is 3.0.1. Bad things might happen unless you revert pyannote.audio to 2.x.
Seems to work on CPU.
For GPU (mac m1 max) i got this error: NotImplementedError: The operator 'aten::upsample_linear1d.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable
PYTORCH_ENABLE_MPS_FALLBACK=1to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
after setting the env var it work but with a mix of gpu and cpu.
Thanks @stygmate for the feedback.
To use the same setup as pyannote/speaker-diarization-3.0
, one should use the following:
from pyannote.audio.pipelines import SpeakerDiarization
from pyannote.audio.pipelines.utils.hook import ProgressHook
from pyannote.audio import Audio
import torch
pipeline = SpeakerDiarization(
segmentation="pyannote/segmentation-3.0",
segmentation_batch_size=32
embedding="pyannote/wespeaker-voxceleb-resnet34-LM",
embedding_exclude_overlap=True,
embedding_batch_size=32)
# other values of `*_batch_size` may lead to faster processing.
# the larger may not necessarily be the faster.
pipeline.instantiate({
"segmentation": {
"min_duration_off": 0.0,
},
"clustering": {
"method": "centroid",
"min_cluster_size": 12,
"threshold": 0.7045654963945799,
},
})
# send the pipeline to your prefered device
device = torch.device("cpu")
device = torch.device("cuda")
device = torch.device("mps")
pipeline.to(device)
# load audio in memory (usually leads to faster processing)
io = Audio(mono='downmix', sample_rate=16000)
waveform, sample_rate = io(audio)
file = {"waveform": waveform, "sample_rate": sample_rate}
# process the audio
with ProgressHook() as hook:
diarization = pipeline(file, hook=hook)
I'd love to get feedback from you all regarding possible algorithmic or speed regressions .
@hbredin Give me a wav file to process, I will send you the results.
Closing as latest version no longer relies on ONNX runtime.
Please update to pyannote.audio 3.1
and pyannote/speaker-diarization-3.1
(and open new issues if needed).
It work ok . But I use torch 1.XX segmentation ---------------------------------------- 100% 0:00:09 speaker_counting ---------------------------------------- 100% 0:00:00 embeddings ---------------------------------------- 100% 0:06:39 discrete_diarization ---------------------------------------- 100% 0:00:00
And i made some changes for comptability with torch 1.xx and torch 2.xx file \pyannote\audio\models\embedding\wespeaker__init__.py
if torch.__version__ >= "2.0.0":
# Use torch.vmap for torch 2.0 or newer
from torch import vmap
else:
# Use functorch.vmap for torch 1.12 or older
from functorch import vmap
And change
features = torch.vmap(self._fbank)(waveforms.to(fft_device)).to(device)
to
features = vmap(self._fbank)(waveforms.to(fft_device)).to(device)
Same changes in file \pyannote\audio\models\blocks\pooling.py
Thanks for the feedback (and the PR!).
However, I don't plan to support torch 1.x
in the future.
Since its introduction in
pyannote.audio
3.x, the ONNX dependency seems to cause lots of problem topyannote
users: #1526 #1523 #1517 #1510 #1508 #1481 #1478 #1477 #1475WeSpeaker does provide a pytorch implementation of its pretrained ResNet models.
Let's use this!