Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on
Pitch and Rhythm
summary: Recent developments in neural speech synthesis and vocoding have sparked a
renewed interest in voice conversion (VC). Beyond timbre transfer, achieving
controllability on para-linguistic parameters such as pitch and rhythm is
critical in deploying VC systems in many application scenarios. Existing
studies, however, either only provide utterance-level global control or lack
interpretability on the controls. In this paper, we propose ControlVC, the
first neural voice conversion system that achieves time-varying controls on
pitch and rhythm. ControlVC uses pre-trained encoders to compute pitch
embeddings and linguistic embeddings from the source utterance and speaker
embeddings from the target utterance. These embeddings are then concatenated
and converted to speech using a vocoder. It achieves rhythm control through
TD-PSOLA pre-processing on the source utterance, and achieves pitch control by
manipulating the pitch contour before feeding it to the pitch encoder.
Systematic subjective and objective evaluations are conducted to assess the
speech quality and controllability. Results show that, on non-parallel and
zero-shot conversion tasks, ControlVC significantly outperforms two other
self-constructed baselines on speech quality, and it can successfully achieve
time-varying pitch control.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: ControlVC: Zero-Shot Voice Conversion with Time-Varying Controls on Pitch and Rhythm
summary: Recent developments in neural speech synthesis and vocoding have sparked a renewed interest in voice conversion (VC). Beyond timbre transfer, achieving controllability on para-linguistic parameters such as pitch and rhythm is critical in deploying VC systems in many application scenarios. Existing studies, however, either only provide utterance-level global control or lack interpretability on the controls. In this paper, we propose ControlVC, the first neural voice conversion system that achieves time-varying controls on pitch and rhythm. ControlVC uses pre-trained encoders to compute pitch embeddings and linguistic embeddings from the source utterance and speaker embeddings from the target utterance. These embeddings are then concatenated and converted to speech using a vocoder. It achieves rhythm control through TD-PSOLA pre-processing on the source utterance, and achieves pitch control by manipulating the pitch contour before feeding it to the pitch encoder. Systematic subjective and objective evaluations are conducted to assess the speech quality and controllability. Results show that, on non-parallel and zero-shot conversion tasks, ControlVC significantly outperforms two other self-constructed baselines on speech quality, and it can successfully achieve time-varying pitch control.
id: http://arxiv.org/abs/2209.11866v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.