juanmc2005 / diart

A python package to build AI-powered real-time audio applications
https://diart.readthedocs.io
MIT License
903 stars 76 forks source link

diart vs whisperx diarization accuracy #226

Closed nurgel closed 6 months ago

nurgel commented 6 months ago

trying the whisper_diart example here (https://gist.github.com/juanmc2005/ed6413e697e176cb36a149d8c40a3a5b) on a remote WebsocketAudioSource on an A100 with whisper large. encountering the following issues in the process. diart:

this did not happen in whisperx out of the box. however, realtime capabilities of diart is very tempting for a realtime app. are there any parameters that could be tweaked to improve/match the performance?

thaokimctu commented 6 months ago

I think the problem is within the identify_speakers function:

        # Assign a speaker to the segment based on diarization
        speakers = dia.labels()
        num_speakers = len(speakers)
        if num_speakers == 0:
            # No speakers were detected
            caption = (-1, segment["text"])
        elif num_speakers == 1:
            # Only one speaker is active in this segment
            spk_id = int(speakers[0].split("speaker")[1])
            caption = (spk_id, segment["text"])
        else:
            # Multiple speakers, select the one that speaks the most
            max_speaker = int(np.argmax([
                dia.label_duration(spk) for spk in speakers
            ]))
            caption = (max_speaker, segment["text"])
        speaker_captions.append(caption)

    return speaker_captions`

The max_speaker = int(np.argmax([dia.label_duration(spk) for spk in speakers])) would return the index of the speaker with longest speaking duration so I think caption = (max_speaker, segment["text"]) would be caption = (speakers[max_speaker], segment["text"])

About tweaking parameters you could check out this issue

nurgel commented 6 months ago

thank you for a response. tried your suggestion. however, the issue seems to be lower level. overall, this library does not appear to be production-ready for now.

juanmc2005 commented 6 months ago

however, the issue seems to be lower level.

@nurgel could you explain what you mean by "lower level"?

Remember that offline diarization works with the entire context of a pre-recorded conversation, which is why most state-of-the-art systems nowadays will be way better at determining the number of speakers in a recording.

In streaming diarization, you need to discover speakers as you go, and with little context available (to fulfill real-time requirements). This makes the task considerably more complicated. Streaming diarization is unfortunately not at the level of offline diarization yet.

Moreover, as @thaokimctu correctly suggested, you should consider diart's hyper-parameters, in particular delta_new if you find it tends to create too many speakers (I suggest you try increasing it). These hyper-parameters should be tuned to conversations that are similar to what you expect to see in production, and you may need to collect some data to do this, as with anything in machine learning, there is no free lunch. Additionally, you may try the many new models that are now compatible as part of v0.9.

On the other hand, the gist combining diart and whisper is supposed to be a demo of the composability power of diart, not a production-ready solution. In fact, the transcription feature is still a work in progress and hasn't been released officially. Many improvements can be made to the solution I shared, certainly more than my free time allows to develop.

If you find something could be improved, I would gladly welcome ideas and contributions.

nurgel commented 6 months ago

thank you for an insightful response @juanmc2005

by ‘lower level’ i meant not related to the code given in the gist, but related to the modules or the model weights used.


the difficulty of realtime diarization is clear considering that there is no viable alternative to diart. i am rushing deadlines, so was mostly looking for a free lunch that is general enough that it works magically with minimal effort on my side (somewhat sounds like AGI) :) looking forward to SpeakerAwareTranscription if/when you decide to share it with the world. all the best!