Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Assem-VC: Realistic Voice Conversion by Assembling Modern Speech
Synthesis Techniques
summary: In this paper, we pose the current state-of-the-art voice conversion (VC)
systems as two-encoder-one-decoder models. After comparing these models, we
combine the best features and propose Assem-VC, a new state-of-the-art
any-to-many non-parallel VC system. This paper also introduces the GTA
finetuning in VC, which significantly improves the quality and the speaker
similarity of the outputs. Assem-VC outperforms the previous state-of-the-art
approaches in both the naturalness and the speaker similarity on the VCTK
dataset. As an objective result, the degree of speaker disentanglement of
features such as phonetic posteriorgrams (PPG) is also explored. Our
investigation indicates that many-to-many VC results are no longer distinct
from human speech and similar quality can be achieved with any-to-many models.
Audio samples are available at https://mindslab-ai.github.io/assem-vc/
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques
summary: In this paper, we pose the current state-of-the-art voice conversion (VC) systems as two-encoder-one-decoder models. After comparing these models, we combine the best features and propose Assem-VC, a new state-of-the-art any-to-many non-parallel VC system. This paper also introduces the GTA finetuning in VC, which significantly improves the quality and the speaker similarity of the outputs. Assem-VC outperforms the previous state-of-the-art approaches in both the naturalness and the speaker similarity on the VCTK dataset. As an objective result, the degree of speaker disentanglement of features such as phonetic posteriorgrams (PPG) is also explored. Our investigation indicates that many-to-many VC results are no longer distinct from human speech and similar quality can be achieved with any-to-many models. Audio samples are available at https://mindslab-ai.github.io/assem-vc/
id: http://arxiv.org/abs/2104.00931v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.