Closed SheenChi closed 3 months ago
Hi @SheenChi, the values I reported were obtained from the output of diart.stream
with my hardware: CPU AMD Ryzen 9 and GPU Nvidia RTX 4060 Max-Q.
If you find the model too slow on your hardware you can try using pyannote/embedding
, which is the fastest one. If that's still not enough you could try quantizing a model you like or distilling it into a smaller model. Depending on your hardware, I think distillation would be my preferred choice as a first step, but it requires training.
For training I recommend you use pyannote.audio, as it's very reliable for this use case and would give you instant compatibility with diart
hello @juanmc2005 I use the hbredin/wespeaker-voxceleb-resnet34-LM (ONNX) model to extract speaker embedding in diarization pipeline, but I found the latency is too large(1300ms) when calculate per chunk with the default params (chunk=5s, step=0.5s, latency=0.5), this can not meet the real time requirement. I found you post the delay performance is 48ms when use cpu and 15ms use gpu. Is there anything I need to pay attention to when reproducing your performance。 Thank you very much for any suggestions