Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Automatic Speech Disentanglement for Voice Conversion using Rank Module and Speech Augmentation
summary: Voice Conversion (VC) converts the voice of a source speech to that of a
target while maintaining the source's content. Speech can be mainly decomposed
into four components: content, timbre, rhythm and pitch. Unfortunately, most
related works only take into account content and timbre, which results in less
natural speech. Some recent works are able to disentangle speech into several
components, but they require laborious bottleneck tuning or various
hand-crafted features, each assumed to contain disentangled speech information.
In this paper, we propose a VC model that can automatically disentangle speech
into four components using only two augmentation functions, without the
requirement of multiple hand-crafted features or laborious bottleneck tuning.
The proposed model is straightforward yet efficient, and the empirical results
demonstrate that our model can achieve a better performance than the baseline,
regarding disentanglement effectiveness and speech naturalness.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Automatic Speech Disentanglement for Voice Conversion using Rank Module and Speech Augmentation
summary: Voice Conversion (VC) converts the voice of a source speech to that of a target while maintaining the source's content. Speech can be mainly decomposed into four components: content, timbre, rhythm and pitch. Unfortunately, most related works only take into account content and timbre, which results in less natural speech. Some recent works are able to disentangle speech into several components, but they require laborious bottleneck tuning or various hand-crafted features, each assumed to contain disentangled speech information. In this paper, we propose a VC model that can automatically disentangle speech into four components using only two augmentation functions, without the requirement of multiple hand-crafted features or laborious bottleneck tuning. The proposed model is straightforward yet efficient, and the empirical results demonstrate that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness and speech naturalness.
id: http://arxiv.org/abs/2306.12259v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.