hitachi-speech / EEND

End-to-End Neural Diarization
MIT License
377 stars 59 forks source link

How to evaluate long records with SA-EED. #22

Closed IvanAntipov closed 3 years ago

IvanAntipov commented 3 years ago

The EEND self attention paper states we split the input audio recordings into non-overlapping 50-second segments. At the inference stage, we used the entire sequence for each recording

As far as I understand, receptive field of TransformerEncoder is limited by n_units. I can't just split the record by 50 seconds segment and evaluate each segment separately, because the labeling of speaker in each segment can be different.

What is the right way to handle long record with SA-EEND model?

IvanAntipov commented 3 years ago

I found more recent articles about EEND. This problem is addressed there. See Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech. Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara