summary: Factorizing speech as disentangled speech representations is vital to achieve
highly controllable style transfer in voice conversion (VC). Conventional
speech representation learning methods in VC only factorize speech as speaker
and content, lacking controllability on other prosody-related factors.
State-of-the-art speech representation learning methods for more speech factors
are using primary disentangle algorithms such as random resampling and ad-hoc
bottleneck layer size adjustment, which however is hard to ensure robust speech
representation disentanglement. To increase the robustness of highly
controllable style transfer on multiple factors in VC, we propose a
disentangled speech representation learning framework based on adversarial
learning. Four speech representations characterizing content, timbre, rhythm
and pitch are extracted, and further disentangled by an adversarial network
inspired by BERT. The adversarial network is used to minimize the correlations
between the speech representations, by randomly masking and predicting one of
the representations from the others. A word prediction network is also adopted
to learn a more informative content representation. Experimental results show
that the proposed speech representation learning framework significantly
improves the robustness of VC on multiple factors by increasing conversion rate
from 48.2% to 57.1% and ABX preference exceeding by 31.2% compared with
state-of-the-art method.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Adversarially learning disentangled speech representations for robust multi-factor voice conversion
summary: Factorizing speech as disentangled speech representations is vital to achieve highly controllable style transfer in voice conversion (VC). Conventional speech representation learning methods in VC only factorize speech as speaker and content, lacking controllability on other prosody-related factors. State-of-the-art speech representation learning methods for more speech factors are using primary disentangle algorithms such as random resampling and ad-hoc bottleneck layer size adjustment, which however is hard to ensure robust speech representation disentanglement. To increase the robustness of highly controllable style transfer on multiple factors in VC, we propose a disentangled speech representation learning framework based on adversarial learning. Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled by an adversarial network inspired by BERT. The adversarial network is used to minimize the correlations between the speech representations, by randomly masking and predicting one of the representations from the others. A word prediction network is also adopted to learn a more informative content representation. Experimental results show that the proposed speech representation learning framework significantly improves the robustness of VC on multiple factors by increasing conversion rate from 48.2% to 57.1% and ABX preference exceeding by 31.2% compared with state-of-the-art method.
id: http://arxiv.org/abs/2102.00184v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.