juanmc2005 / diart

A python package to build AI-powered real-time audio applications
https://diart.readthedocs.io
MIT License
1.02k stars 87 forks source link

The latency of wespeaker model is to large #225

Closed SheenChi closed 3 months ago

SheenChi commented 9 months ago

hello @juanmc2005 I use the hbredin/wespeaker-voxceleb-resnet34-LM (ONNX) model to extract speaker embedding in diarization pipeline, but I found the latency is too large(1300ms) when calculate per chunk with the default params (chunk=5s, step=0.5s, latency=0.5), this can not meet the real time requirement. I found you post the delay performance is 48ms when use cpu and 15ms use gpu. Is there anything I need to pay attention to when reproducing your performance。 Thank you very much for any suggestions

juanmc2005 commented 9 months ago

Hi @SheenChi, the values I reported were obtained from the output of diart.stream with my hardware: CPU AMD Ryzen 9 and GPU Nvidia RTX 4060 Max-Q.

If you find the model too slow on your hardware you can try using pyannote/embedding, which is the fastest one. If that's still not enough you could try quantizing a model you like or distilling it into a smaller model. Depending on your hardware, I think distillation would be my preferred choice as a first step, but it requires training.

For training I recommend you use pyannote.audio, as it's very reliable for this use case and would give you instant compatibility with diart