Question about infer.py for EDA

Hello! I read the infer.py file and as I understand, it firstly divide complete audio into chunks and fed these chunks into the model. At the end, it stack all the outputs to make rttm file. out_chunks.append(ys[0].data) ...... out_chunks = [np.insert(o, o.shape[1], np.zeros((max_n_speakers - o.shape[1], o.shape[0])), axis=1) for o in out_chunks] outdata = np.vstack(out_chunks) I'm a little confused about how you can make sure the speaker orders of each chunks are consistent for the EDA model？ Because the attractors in EDA are dynamically generated based on the chunk. One speaker may disappear in another chunk of the same audio?

hitachi-speech / EEND

Question about infer.py for EDA #30