hitachi-speech / EEND

End-to-End Neural Diarization
MIT License
377 stars 59 forks source link

Unable to identify speaker cluster from generated RTTM #4

Closed Durgesh92 closed 3 years ago

Durgesh92 commented 4 years ago

I have trained a model using callhome recipe. The generated RTTM looks like follows.

SPEAKER 2290120-audio 1 0.00 0.55 < NA > < NA > 2290120-audio_0 < NA > SPEAKER 2290120-audio 1 0.60 1.05 < NA > < NA > 2290120-audio_0 < NA > SPEAKER 2290120-audio 1 1.75 0.20 < NA > < NA > 2290120-audio_0 < NA >

Is there any way to identify who spoke when? For example, the RTTM generated by Kaldi diarization recipe looks like follows

SPEAKER 2290120-audio 1 0.00 0.55 < NA > < NA > A < NA > SPEAKER 2290120-audio 1 0.60 1.05 < NA > < NA > B < NA > SPEAKER 2290120-audio 1 1.75 0.20 < NA > < NA > A < NA >

yubouf commented 4 years ago

My understanding is that you trained a model with different datasets according to the callhome recipe. In my script, the speaker id in RTTM is generated as file_id + '_' + output_index. So 2290120-audio_0 means the first(0-th index) speaker of 2290120. You might find 2290120-audio_1 as the second speaker, or no second speaker.

Durgesh92 commented 4 years ago

I see, But it never Identifies the second speaker, its always like file_id + '_0' . However, I have multiple speakers in the test file. Do you have any idea why it can't recognize more than one speaker? (I have trained on around 97 hours of multi-speaker data)

yubouf commented 4 years ago

Just for debugging, you can feed the training samples in the evaluation step.

I have never tried less than 100 hours for training data. I usually use over 2,000 hours of speech generated by using simulation. And I use a small real dataset for adaptation. I would have to say our method was not validated in a small data setup.