modelscope / 3D-Speaker

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
Apache License 2.0
1.07k stars 93 forks source link

Inference acceleration #73

Closed yangyyt closed 6 months ago

yangyyt commented 6 months ago

When applying the module of speaker classification, hundreds of millions of data inference, how to perform batch inference when vad, extraction embedding.

Thanks to the author for his reply and suggestions

wanghuii1 commented 6 months ago

Currently, batch processing is not supported, only multi-process and multi-GPU processing are supported.

yangyyt commented 6 months ago

Currently, batch processing is not supported, only multi-process and multi-GPU processing are supported.

Thanks, The model of Chinese-English mixed speaker diarization, does ModelScope support it?,I don't seem to see this model on the web side. https://github.com/alibaba-damo-academy/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization

or Will there be plans for open-source batch inference in the future?

yangyyt commented 6 months ago

I tried the example under ModelScope 【https://www.modelscope.cn/models/iic/speech_campplus_speaker-diarization_common/summary】, and found that the same audio, the same speaker extracted the embedding model, and the 3D-Speaker could successfully judge multiple people, but ModelScope could not, and I suspected that ModelScope Pipeline and 3DSpeaker may have some parameters or logic inconsistencies. But it seems that ModelScope doesn't support batch inference either.

wanghuii1 commented 6 months ago

Currently, batch processing is not supported, only multi-process and multi-GPU processing are supported.

Thanks, The model of Chinese-English mixed speaker diarization, does ModelScope support it?,I don't seem to see this model on the web side. https://github.com/alibaba-damo-academy/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization

or Will there be plans for open-source batch inference in the future?

You can change the value of variable “speaker_model_id” to iic/speech_campplus_sv_zh_en_16k-common_advanced to use Chinese-English mixed model. We will support batch inference. @yangyyt

wanghuii1 commented 6 months ago

I tried the example under ModelScope 【https://www.modelscope.cn/models/iic/speech_campplus_speaker-diarization_common/summary】, and found that the same audio, the same speaker extracted the embedding model, and the 3D-Speaker could successfully judge multiple people, but ModelScope could not, and I suspected that ModelScope Pipeline and 3DSpeaker may have some parameters or logic inconsistencies. But it seems that ModelScope doesn't support batch inference either.

The inference processes of the two are almost consistent, except for minor differences. If there are significantly different outputs, the input may be short audio. Since the current pipeline is not robust for recognizing short audio, it is recommended to use longer audio(>1min). @yangyyt