Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Disentangled Feature Learning for Real-Time Neural Speech Coding
summary: Recently end-to-end neural audio/speech coding has shown its great potential
to outperform traditional signal analysis based audio codecs. This is mostly
achieved by following the VQ-VAE paradigm where blind features are learned,
vector-quantized and coded. In this paper, instead of blind end-to-end
learning, we propose to learn disentangled features for real-time neural speech
coding. Specifically, more global-like speaker identity and local content
features are learned with disentanglement to represent speech. Such a compact
feature decomposition not only achieves better coding efficiency by exploiting
bit allocation among different features but also provides the flexibility to do
audio editing in embedding space, such as voice conversion in real-time
communications. Both subjective and objective results demonstrate its coding
efficiency and we find that the learned disentangled features show comparable
performance on any-to-any voice conversion with modern self-supervised speech
representation learning models with far less parameters and low latency,
showing the potential of our neural coding framework.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Disentangled Feature Learning for Real-Time Neural Speech Coding
summary: Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models with far less parameters and low latency, showing the potential of our neural coding framework.
id: http://arxiv.org/abs/2211.11960v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.