Closed Mbonea-Mjema closed 1 month ago
@Mbonea-Mjema afaik it's not using a tts. and there are foss tts that would work much better than this anyway.
I also want emotion in the voice, do you have any ideas how I can achieve that?
It is possible to "trick" Moshi to do zero-shot TTS (or ASR by the way). The hack is explained in Appendix C of the paper.
@guii nice, though i don't think it'd be worth it as there are much more human like sounding tts nowadays, either closed like elevenlabs or open like turtoise and many other.
As mentioned it's not possible with the currently released version of moshi. We have some TTS versions internally that we're pretty happy with and are considering releasing but have to be careful about how to release them (because of voice cloning possibilities for example) - no estimated timeline yet but we'll announce it on twitter/... if we release them at some point.
@LaurentMazare voice cloning can already be done with open source tools look up openvoice for example without even talking about non foss service like elevenlabs. i don't think you are enabling something that isn't already possible by opensourcing your tts. i do understand the concern though.
@LaurentMazare I read the Moshi paper. Very impressed with the level of the detail. At some places it gives mistaken impression though that TTS/ASR can be done with the same model (weights) by introducing 2 second delay. However, later (in section 5.7) it is mentioned that you trained TTS and ASR model (likely following the same architecture). It will be great if it is clarified earlier (In 3.4 under sub section 'Deriving Streaming ASR and TTS').
This is an excellent paper by the way and will down in history as one of the seminal paper on this area (or AI in general). Thank you for the great work by your team (also look forward to getting access to TTS and ASR variants).
@girishp1983 honestly if they do not release training code i'm considering rewritting it myself, i find the model pretty lackluster compared to what the architecture is in theory capable of.
having a smarter model and more human like voices and expression would be great. especially if you can tune it to imitate what you want.
also it would be nice to be able to extend beyond the 5 minutes even if that means leaving context behind.
@alkeryn @LaurentMazare Great work.
I had a question about "Multi-Stream TTS" in Appendix C: How to put the text from the two speakers into the single text stream?
@alkeryn @LaurentMazare Great work.
I had a question about "Multi-Stream TTS" in Appendix C: How to put the text from the two speakers into the single text stream?
As mentioned in the extract you've put, the <bos>
and <eos>
tokens are used to separate the two speakers, e.g. you would feed the TTS with something like:
<bos>Text for speaker1<eos>Text for speaker2<bos>Second text for speaker1...
The TTS ends up being multi-stream for the audio but using a single text stream so it can be used to generate conversations but you cannot use it to ask for the two speakers speaking at the same time.
I want to use the tts, not the entire model. Is it possible?