openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
7.3k stars 2.27k forks source link

[Feature Request]: Support Moshi speech-text foundation and full-duplex spoken dialogue framework #26845

Open atlury opened 1 month ago

atlury commented 1 month ago

Request Description

Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec. Mimi processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codecs like SpeechTokenizer (50 Hz, 4kbps), or SemantiCodec (50 Hz, 1.3kbps).

More details at https://kyutai.org/Moshi.pdf and https://github.com/kyutai-labs/moshi

Can we support this?

Feature Use Case

Its a novel spoken dialog framework that will be very useful to demonstrate the capabilities of openvino and IA platforms.

Issue submission checklist

slyalin commented 1 month ago

While I suppose this request is more about enabling OpenVINO inference in Moshi framework itself by supporting particular models and integration of the inference code, it also makes sense to create a request for online full-duplex speech/text capabilities in https://github.com/openvinotoolkit/openvino.genai component where we are implementing E2E pipelines. @ilya-lavrenov , @andrei-kochin , FYI.

@atlury, have you tried to integrate OpenVINO to Moshi or run the models from there and failed due to limitations in OpenVINO?

atlury commented 1 month ago

@slyalin I haven't tried yet. There is too much work already with testing of vision models Qwen2-VL, llama 3.2 vision using openvino. Did not get time. I will try. Shall i make request there?

slyalin commented 1 month ago

Shall i make request there?

If you could make it in the form of a proposal in API that would give a more actionable path beyond just an idea -- it would be nice. You can review WIP Whisper pipeline and the sample here: https://github.com/openvinotoolkit/openvino.genai/blob/master/samples/python/whisper_speech_recognition/whisper_speech_recognition.py. This is the closest match but is still far to be the same.

Aryan8912 commented 1 week ago

sir, may I work with you to solve this bug