Closed stepbystep88 closed 1 month ago
There is no explicit signaling. The model consistently output an audio stream, sometimes decoding to silence, sometimes not. One can derive that the model is silent if the text tokens it outputs are PAD tokens (e.g. 3). The system is not perfect, in particular due to some biases in the synthetic data used which can make Moshi think the user is not done talking while they are, but we will be looking into improving that in the future.
Due diligence
Topic
The paper
Question
Could you please provide more details on how to implement interruption and signal to the model when to respond? For example, how does the model recognize when the user has finished speaking, especially in noisy environments? and how two streams work together from the code.