Closed Durgesh92 closed 3 years ago
My understanding is that you trained a model with different datasets according to the callhome recipe.
In my script, the speaker id in RTTM is generated as file_id + '_' + output_index
.
So 2290120-audio_0
means the first(0-th index) speaker of 2290120
.
You might find 2290120-audio_1
as the second speaker, or no second speaker.
I see, But it never Identifies the second speaker, its always like file_id + '_0' . However, I have multiple speakers in the test file. Do you have any idea why it can't recognize more than one speaker? (I have trained on around 97 hours of multi-speaker data)
Just for debugging, you can feed the training samples in the evaluation step.
I have never tried less than 100 hours for training data. I usually use over 2,000 hours of speech generated by using simulation. And I use a small real dataset for adaptation. I would have to say our method was not validated in a small data setup.
I have trained a model using callhome recipe. The generated RTTM looks like follows.
SPEAKER 2290120-audio 1 0.00 0.55 < NA > < NA > 2290120-audio_0 < NA > SPEAKER 2290120-audio 1 0.60 1.05 < NA > < NA > 2290120-audio_0 < NA > SPEAKER 2290120-audio 1 1.75 0.20 < NA > < NA > 2290120-audio_0 < NA >
Is there any way to identify who spoke when? For example, the RTTM generated by Kaldi diarization recipe looks like follows
SPEAKER 2290120-audio 1 0.00 0.55 < NA > < NA > A < NA > SPEAKER 2290120-audio 1 0.60 1.05 < NA > < NA > B < NA > SPEAKER 2290120-audio 1 1.75 0.20 < NA > < NA > A < NA >