speaker tag identification number error and error rate are unacceptable

If I set the maximum and minimum number of speaker tags to 2, then the error rate of speaker tag identification will skysoar. It shows that the minutelong part of the audio between two people is recognized as the same speaker tag. If the number of speakers is not set, several speaker tags will be identified. Of course, in addition to these two cases, there will also be common occurrences such as hello, yes and other answers that are not recognized as normal speaker labels

Is there anything you can do to solve the problem or optimize

When the number of speaker tags is not set, four speaker tags 00, 01, 02 and 03 are identified Dingtalk_20231222161858

When the maximum and minimum number of speaker tags are set to 2, two speaker tags 00 and 01 are recognized, but the accuracy is too poor for me to accept Dingtalk_20231222161942

Dingtalk_20231222161021

m-bain / whisperX

speaker tag identification number error and error rate are unacceptable #640