kyutai-labs / moshi

Apache License 2.0
6.71k stars 516 forks source link

TTS model #97

Closed Mbonea-Mjema closed 1 month ago

Mbonea-Mjema commented 1 month ago

I want to use the tts, not the entire model. Is it possible?

alkeryn commented 1 month ago

@Mbonea-Mjema afaik it's not using a tts. and there are foss tts that would work much better than this anyway.

Mbonea-Mjema commented 1 month ago

I also want emotion in the voice, do you have any ideas how I can achieve that?

guii commented 1 month ago

It is possible to "trick" Moshi to do zero-shot TTS (or ASR by the way). The hack is explained in Appendix C of the paper.

alkeryn commented 1 month ago

@guii nice, though i don't think it'd be worth it as there are much more human like sounding tts nowadays, either closed like elevenlabs or open like turtoise and many other.

LaurentMazare commented 1 month ago

As mentioned it's not possible with the currently released version of moshi. We have some TTS versions internally that we're pretty happy with and are considering releasing but have to be careful about how to release them (because of voice cloning possibilities for example) - no estimated timeline yet but we'll announce it on twitter/... if we release them at some point.

alkeryn commented 1 month ago

@LaurentMazare voice cloning can already be done with open source tools look up openvoice for example without even talking about non foss service like elevenlabs. i don't think you are enabling something that isn't already possible by opensourcing your tts. i do understand the concern though.

girishp1983 commented 1 month ago

@LaurentMazare I read the Moshi paper. Very impressed with the level of the detail. At some places it gives mistaken impression though that TTS/ASR can be done with the same model (weights) by introducing 2 second delay. However, later (in section 5.7) it is mentioned that you trained TTS and ASR model (likely following the same architecture). It will be great if it is clarified earlier (In 3.4 under sub section 'Deriving Streaming ASR and TTS').

This is an excellent paper by the way and will down in history as one of the seminal paper on this area (or AI in general). Thank you for the great work by your team (also look forward to getting access to TTS and ASR variants).

alkeryn commented 1 month ago

@girishp1983 honestly if they do not release training code i'm considering rewritting it myself, i find the model pretty lackluster compared to what the architecture is in theory capable of.

having a smarter model and more human like voices and expression would be great. especially if you can tune it to imitate what you want.

also it would be nice to be able to extend beyond the 5 minutes even if that means leaving context behind.

lifeiteng commented 1 day ago

@alkeryn @LaurentMazare Great work.

I had a question about "Multi-Stream TTS" in Appendix C: How to put the text from the two speakers into the single text stream?

截屏2024-11-11 16 44 15
LaurentMazare commented 1 day ago

@alkeryn @LaurentMazare Great work.

I had a question about "Multi-Stream TTS" in Appendix C: How to put the text from the two speakers into the single text stream?

As mentioned in the extract you've put, the <bos> and <eos> tokens are used to separate the two speakers, e.g. you would feed the TTS with something like:

<bos>Text for speaker1<eos>Text for speaker2<bos>Second text for speaker1...

The TTS ends up being multi-stream for the audio but using a single text stream so it can be used to generate conversations but you cannot use it to ask for the two speakers speaking at the same time.