Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion
summary: Text-to-speech (TTS) and voice conversion (VC) are two different tasks both
aiming at generating high quality speaking voice according to different input
modality. Due to their similarity, this paper proposes UnifySpeech, which
brings TTS and VC into a unified framework for the first time. The model is
based on the assumption that speech can be decoupled into three independent
components: content information, speaker information, prosody information. Both
TTS and VC can be regarded as mining these three parts of information from the
input and completing the reconstruction of speech. For TTS, the speech content
information is derived from the text, while in VC it's derived from the source
speech, so all the remaining units are shared except for the speech content
extraction module in the two tasks. We applied vector quantization and domain
constrain to bridge the gap between the content domains of TTS and VC.
Objective and subjective evaluation shows that by combining the two task, TTS
obtains better speaker modeling ability while VC gets hold of impressive speech
content decoupling capability.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion
summary: Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two tasks. We applied vector quantization and domain constrain to bridge the gap between the content domains of TTS and VC. Objective and subjective evaluation shows that by combining the two task, TTS obtains better speaker modeling ability while VC gets hold of impressive speech content decoupling capability.
id: http://arxiv.org/abs/2301.03801v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.