m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 4-Clause "Original" or "Old" License
9.98k stars 1.04k forks source link

speaker tag identification number error and error rate are unacceptable #640

Open JiangN6 opened 6 months ago

JiangN6 commented 6 months ago

If I set the maximum and minimum number of speaker tags to 2, then the error rate of speaker tag identification will skysoar. It shows that the minutelong part of the audio between two people is recognized as the same speaker tag. If the number of speakers is not set, several speaker tags will be identified. Of course, in addition to these two cases, there will also be common occurrences such as hello, yes and other answers that are not recognized as normal speaker labels

Is there anything you can do to solve the problem or optimize

When the number of speaker tags is not set, four speaker tags 00, 01, 02 and 03 are identified Dingtalk_20231222161858

When the maximum and minimum number of speaker tags are set to 2, two speaker tags 00 and 01 are recognized, but the accuracy is too poor for me to accept Dingtalk_20231222161942

Dingtalk_20231222161021

krishnareddyML commented 4 months ago

Have you got any solution for improving the diarization accuracy? I am also getting issue on recognising the correct speaker tagging, some time two distinct speakers conversation mixed up and shown under single speaker.