kyutai-labs / moshi

Apache License 2.0
6.77k stars 532 forks source link

how to implement interruption? how two streams work together from the code. #110

Closed stepbystep88 closed 1 month ago

stepbystep88 commented 1 month ago

Due diligence

Topic

The paper

Question

Could you please provide more details on how to implement interruption and signal to the model when to respond? For example, how does the model recognize when the user has finished speaking, especially in noisy environments? and how two streams work together from the code.

adefossez commented 1 month ago

There is no explicit signaling. The model consistently output an audio stream, sometimes decoding to silence, sometimes not. One can derive that the model is silent if the text tokens it outputs are PAD tokens (e.g. 3). The system is not perfect, in particular due to some biases in the synthetic data used which can make Moshi think the user is not done talking while they are, but we will be looking into improving that in the future.