Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
summary: Full-duplex spoken dialogue systems significantly advance over traditional
turn-based dialogue systems, as they allow simultaneous bidirectional
communication, closely mirroring human-human interactions. However, achieving
low latency and natural interactions in full-duplex dialogue systems remains a
significant challenge, especially considering human conversation dynamics such
as interruptions, backchannels, and overlapping speech. In this paper, we
introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex
conversation, capable of effectively modeling the complex behaviors inherent to
natural conversations with low latency. To achieve full-duplex communication
capabilities, we propose a multi-stage post-training scheme that progressively
adapts a text-based large language model (LLM) backbone into a speech-text
dialogue LLM, capable of generating text and speech in real time, without
modifying the architecture of the backbone LLM. The training process comprises
three stages: modality alignment, half-duplex dialogue learning, and
full-duplex dialogue learning. Throughout all training stages, we standardize
the data using a flattening operation, which allows us to unify the training
methods and the model architecture across different modalities and tasks. Our
approach offers a straightforward modeling technique and a promising research
direction for developing efficient and natural end-to-end full-duplex spoken
dialogue systems. Audio samples of dialogues generated by OmniFlatten can be
found at this web site (https://omniflatten.github.io/).
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
summary: Full-duplex spoken dialogue systems significantly advance over traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex communication capabilities, we propose a multi-stage post-training scheme that progressively adapts a text-based large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. Throughout all training stages, we standardize the data using a flattening operation, which allows us to unify the training methods and the model architecture across different modalities and tasks. Our approach offers a straightforward modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).
id: http://arxiv.org/abs/2410.17799v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.