hitachi-speech / EEND

End-to-End Neural Diarization
MIT License
377 stars 59 forks source link

Question about infer.py for EDA #30

Open Achronferry opened 3 years ago

Achronferry commented 3 years ago

Hello! I read the infer.py file and as I understand, it firstly divide complete audio into chunks and fed these chunks into the model. At the end, it stack all the outputs to make rttm file. out_chunks.append(ys[0].data) ...... out_chunks = [np.insert(o, o.shape[1], np.zeros((max_n_speakers - o.shape[1], o.shape[0])), axis=1) for o in out_chunks] outdata = np.vstack(out_chunks) I'm a little confused about how you can make sure the speaker orders of each chunks are consistent for the EDA model? Because the attractors in EDA are dynamically generated based on the chunk. One speaker may disappear in another chunk of the same audio?

shota-horiguchi commented 2 years ago

Chunking during inference is not expected. Please make sure that the chunk size is large enough so that the recording is not split during inference.