Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice
Conversion for everyone
summary: YourTTS brings the power of a multilingual approach to the task of zero-shot
multi-speaker TTS. Our method builds upon the VITS model and adds several novel
modifications for zero-shot multi-speaker and multilingual training. We
achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and
results comparable to SOTA in zero-shot voice conversion on the VCTK dataset.
Additionally, our approach achieves promising results in a target language with
a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS
and zero-shot voice conversion systems in low-resource languages. Finally, it
is possible to fine-tune the YourTTS model with less than 1 minute of speech
and achieve state-of-the-art results in voice similarity and with reasonable
quality. This is important to allow synthesis for speakers with a very
different voice or recording characteristics from those seen during training.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
summary: YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
id: http://arxiv.org/abs/2112.02418v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.