Open atlury opened 1 month ago
While I suppose this request is more about enabling OpenVINO inference in Moshi framework itself by supporting particular models and integration of the inference code, it also makes sense to create a request for online full-duplex speech/text capabilities in https://github.com/openvinotoolkit/openvino.genai component where we are implementing E2E pipelines. @ilya-lavrenov , @andrei-kochin , FYI.
@atlury, have you tried to integrate OpenVINO to Moshi or run the models from there and failed due to limitations in OpenVINO?
@slyalin I haven't tried yet. There is too much work already with testing of vision models Qwen2-VL, llama 3.2 vision using openvino. Did not get time. I will try. Shall i make request there?
Shall i make request there?
If you could make it in the form of a proposal in API that would give a more actionable path beyond just an idea -- it would be nice. You can review WIP Whisper pipeline and the sample here: https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/whisper_speech_recognition/whisper_speech_recognition.py. This is the closest match but is still far to be the same.
sir, may I work with you to solve this bug
Request Description
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec. Mimi processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codecs like SpeechTokenizer (50 Hz, 4kbps), or SemantiCodec (50 Hz, 1.3kbps).
More details at https://kyutai.org/Moshi.pdf and https://github.com/kyutai-labs/moshi
Can we support this?
Feature Use Case
Its a novel spoken dialog framework that will be very useful to demonstrate the capabilities of openvino and IA platforms.
Issue submission checklist