Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic
Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear
Prediction
summary: This paper presents a low-latency real-time (LLRT) non-parallel voice
conversion (VC) framework based on cyclic variational autoencoder (CycleVAE)
and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a
robust non-parallel multispeaker spectral model, which utilizes a
speaker-independent latent space and a speaker-dependent code to generate
reconstructed/converted spectral features given the spectral features of an
input speaker. On the other hand, MWDLP is an efficient and a high-quality
neural vocoder that can handle multispeaker data and generate speech waveform
for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we
propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral
features and is built with a sparse network architecture. Further, to improve
the modeling performance, we also propose a novel fine-tuning procedure that
refines the frame-rate CycleVAE network by utilizing the waveform loss from the
MWDLP network. The experimental results demonstrate that the proposed framework
achieves high-performance VC, while allowing for LLRT usage with a single-core
of $2.1$--$2.7$~GHz CPU on a real-time factor of $0.87$--$0.95$, including
input/output, feature extraction, on a frame shift of $10$ ms, a window length
of $27.5$ ms, and $2$ lookup frames.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction
summary: This paper presents a low-latency real-time (LLRT) non-parallel voice conversion (VC) framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a robust non-parallel multispeaker spectral model, which utilizes a speaker-independent latent space and a speaker-dependent code to generate reconstructed/converted spectral features given the spectral features of an input speaker. On the other hand, MWDLP is an efficient and a high-quality neural vocoder that can handle multispeaker data and generate speech waveform for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral features and is built with a sparse network architecture. Further, to improve the modeling performance, we also propose a novel fine-tuning procedure that refines the frame-rate CycleVAE network by utilizing the waveform loss from the MWDLP network. The experimental results demonstrate that the proposed framework achieves high-performance VC, while allowing for LLRT usage with a single-core of $2.1$--$2.7$~GHz CPU on a real-time factor of $0.87$--$0.95$, including input/output, feature extraction, on a frame shift of $10$ ms, a window length of $27.5$ ms, and $2$ lookup frames.
id: http://arxiv.org/abs/2105.09858v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.