Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training
summary: Emotional voice conversion (EVC) aims to change the emotional state of an
utterance while preserving the linguistic content and speaker identity. In this
paper, we propose a novel 2-stage training strategy for sequence-to-sequence
emotional voice conversion with a limited amount of emotional speech data. We
note that the proposed EVC framework leverages text-to-speech (TTS) as they
share a common goal that is to generate high-quality expressive voice. In stage
1, we perform style initialization with a multi-speaker TTS corpus, to
disentangle speaking style and linguistic content. In stage 2, we perform
emotion training with a limited amount of emotional speech data, to learn how
to disentangle emotional style and linguistic information from the speech. The
proposed framework can perform both spectrum and prosody conversion and
achieves significant improvement over the state-of-the-art baselines in both
objective and subjective evaluation.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training
summary: Emotional voice conversion (EVC) aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. In this paper, we propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. We note that the proposed EVC framework leverages text-to-speech (TTS) as they share a common goal that is to generate high-quality expressive voice. In stage 1, we perform style initialization with a multi-speaker TTS corpus, to disentangle speaking style and linguistic content. In stage 2, we perform emotion training with a limited amount of emotional speech data, to learn how to disentangle emotional style and linguistic information from the speech. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
id: http://arxiv.org/abs/2103.16809v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.