Closed IvanAntipov closed 3 years ago
I found more recent articles about EEND. This problem is addressed there.
See Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech. Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara
The EEND self attention paper states
we split the input audio recordings into non-overlapping 50-second segments. At the inference stage, we used the entire sequence for each recording
As far as I understand, receptive field of TransformerEncoder is limited by n_units. I can't just split the record by 50 seconds segment and evaluate each segment separately, because the labeling of speaker in each segment can be different.
What is the right way to handle long record with SA-EEND model?