taylorlu / Speaker-Diarization

speaker diarization by uis-rnn and speaker embedding by vgg-speaker-recognition
Apache License 2.0
455 stars 124 forks source link

Very poor performance on my own wav file, is there anything wrong? #40

Open Yunlong-He opened 3 years ago

Yunlong-He commented 3 years ago

I just did a simple try with my phone call wave file, which is about 2.5 minutes, only 2 speakers in total. However, with pretrained model in this project, it returns 3 speakers and many slices contains voices from 2 speakers, I know that uis-rnn doesn't support setting speaker numbers, but the poor performance seems incorrect, has anybody met it?

Thanks if any suggestions.

ShuningZhao commented 3 years ago

Is your own wav file in English or Chinese? I had a look at the wav file in this repo, it seems like the models were trained in Mandarin. Hence the results were bad on my English wav files.

Yunlong-He commented 3 years ago

Thanks to Shuning, I verified on Mandarin wave files too, and I just check the sample result provided in this project, it seems not very good too. I use following code to split wave file:

sound = AudioSegment.from_file(wav_path)
for spk,timeDicts in speakerSlice.items():
    print('========= ' + str(spk) + ' =========')
    index = 0
    for timeDict in timeDicts:
        s = timeDict['start']
        e = timeDict['stop']
        seg = sound[s:e]
        seg.export("wavs/rmd/" + str(spk) + "/seg_%d.wav" % index, format="wav")
        index = index + 1