huggingface / speechbox

Apache License 2.0
347 stars 35 forks source link

Add support for specifying the number of speakers in ASRDiarizationPipeline #25

Open Demon-tk opened 1 year ago

Demon-tk commented 1 year ago

Hi @speechbox developers,

I've been using the ASRDiarizationPipeline and noticed that there isn't a built-in option to specify the number of speakers when performing diarization. This feature would be very helpful for scenarios where the number of speakers is already known or can be estimated beforehand, as it can potentially improve the performance of the speaker diarization process.

patrickvonplaten commented 1 year ago

cc @sanchit-gandhi

utility-aagrawal commented 1 year ago

@Demon-tk If you need a workaround for time being, I was able to make num_speakers, min_speakers, and max_speakers work with following minor change in the diarize.py file -

Now, include any of these 3 arguments along with the audio file like this: pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-medium", device=device) out = pipeline(input_vid_path, min_speakers = 2)

Let me know if you have any questions.

@speechbox developers, let me know if you see anything wrong with this workaround. Thanks!

sanchit-gandhi commented 1 year ago

That's a valid workaround - probably what we can do is have specific kwargs for the diarization pipeline, and the asr pipeline

Would you like to open a PR @utility-aagrawal or @Demon-tk to add this support? It would look very similar to specific encoder-decoder kwargs that we have in transformers: https://github.com/huggingface/transformers/blob/dd8b7d28aec80013ad2b25ead4200eea1a6a767e/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L458-L464

utility-aagrawal commented 1 year ago

Thanks @sanchit-gandhi! I can do that for both issues #25 and #27.

utility-aagrawal commented 1 year ago

@Demon-tk, I have added separate kwargs for asr and diarization pipelines. You should be able to specify number of speakers in the ASRDiarizationPipeline now. Please note that you would need to prefix 'diarization_' to make number of speakers work with diarization pipeline:

pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-medium", device=device) out = pipeline(input_vid_path, diarization_num_speakers = 2)

Please close this thread if there are no further questions/issues. Thanks!