Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding
summary: Conventional vocoders are commonly used as analysis tools to provide
interpretable features for downstream tasks such as speech synthesis and voice
conversion. They are built under certain assumptions about the signals
following signal processing principle, therefore, not easily generalizable to
different audio, for example, from speech to singing. In this paper, we propose
a deep neural analyzer, denoted as DeepA - a neural vocoder that extracts F0
and timbre/aperiodicity encoding from the input speech that emulate those
defined in conventional vocoders. Therefore, the resulting parameters are more
interpretable than other latent neural representations. At the same time, as
the deep neural analyzer is learnable, it is expected to be more accurate for
signal reconstruction and manipulation, and generalizable from speech to
singing. The proposed neural analyzer is built based on a variational
autoencoder (VAE) architecture. We show that DeepA improves F0 estimation over
the conventional vocoder (WORLD). To our best knowledge, this is the first
study dedicated to the development of a neural framework for extracting
learnable vocoder-like parameters.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding
summary: Conventional vocoders are commonly used as analysis tools to provide interpretable features for downstream tasks such as speech synthesis and voice conversion. They are built under certain assumptions about the signals following signal processing principle, therefore, not easily generalizable to different audio, for example, from speech to singing. In this paper, we propose a deep neural analyzer, denoted as DeepA - a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulate those defined in conventional vocoders. Therefore, the resulting parameters are more interpretable than other latent neural representations. At the same time, as the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing. The proposed neural analyzer is built based on a variational autoencoder (VAE) architecture. We show that DeepA improves F0 estimation over the conventional vocoder (WORLD). To our best knowledge, this is the first study dedicated to the development of a neural framework for extracting learnable vocoder-like parameters.
id: http://arxiv.org/abs/2110.06434v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.