Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: WavThruVec: Latent speech representation as intermediate features for
neural speech synthesis
summary: Recent advances in neural text-to-speech research have been dominated by
two-stage pipelines utilizing low-level intermediate speech representation such
as mel-spectrograms. However, such predetermined features are fundamentally
limited, because they do not allow to exploit the full potential of a
data-driven approach through learning hidden representations. For this reason,
several end-to-end methods have been proposed. However, such models are harder
to train and require a large number of high-quality recordings with
transcriptions. Here, we propose WavThruVec - a two-stage architecture that
resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as
intermediate speech representation. Since these hidden activations provide
high-level linguistic features, they are more robust to noise. That allows us
to utilize annotated speech datasets of a lower quality to train the
first-stage module. At the same time, the second-stage component can be trained
on large-scale untranscribed audio corpora, as Wav2Vec 2.0 embeddings are
time-aligned and speaker-independent. This results in an increased
generalization capability to out-of-vocabulary words, as well as to a better
generalization to unseen speakers. We show that the proposed model not only
matches the quality of state-of-the-art neural models, but also presents useful
properties enabling tasks like voice conversion or zero-shot synthesis.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: WavThruVec: Latent speech representation as intermediate features for neural speech synthesis
summary: Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end methods have been proposed. However, such models are harder to train and require a large number of high-quality recordings with transcriptions. Here, we propose WavThruVec - a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation. Since these hidden activations provide high-level linguistic features, they are more robust to noise. That allows us to utilize annotated speech datasets of a lower quality to train the first-stage module. At the same time, the second-stage component can be trained on large-scale untranscribed audio corpora, as Wav2Vec 2.0 embeddings are time-aligned and speaker-independent. This results in an increased generalization capability to out-of-vocabulary words, as well as to a better generalization to unseen speakers. We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.
id: http://arxiv.org/abs/2203.16930v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.