Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in
Frames
summary: Non-parallel voice conversion (VC) is a technique for training voice
converters without a parallel corpus. Cycle-consistent adversarial
network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as
benchmark methods. However, owing to their insufficient ability to grasp
time-frequency structures, their application is limited to mel-cepstrum
conversion and not mel-spectrogram conversion despite recent advances in
mel-spectrogram vocoders. To overcome this, CycleGAN-VC3, an improved variant
of CycleGAN-VC2 that incorporates an additional module called time-frequency
adaptive normalization (TFAN), has been proposed. However, an increase in the
number of learned parameters is imposed. As an alternative, we propose
MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained
using a novel auxiliary task called filling in frames (FIF). With FIF, we apply
a temporal mask to the input mel-spectrogram and encourage the converter to
fill in missing frames based on surrounding frames. This task allows the
converter to learn time-frequency structures in a self-supervised manner and
eliminates the need for an additional module such as TFAN. A subjective
evaluation of the naturalness and speaker similarity showed that
MaskCycleGAN-VC outperformed both CycleGAN-VC2 and CycleGAN-VC3 with a model
size similar to that of CycleGAN-VC2. Audio samples are available at
http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames
summary: Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion despite recent advances in mel-spectrogram vocoders. To overcome this, CycleGAN-VC3, an improved variant of CycleGAN-VC2 that incorporates an additional module called time-frequency adaptive normalization (TFAN), has been proposed. However, an increase in the number of learned parameters is imposed. As an alternative, we propose MaskCycleGAN-VC, which is another extension of CycleGAN-VC2 and is trained using a novel auxiliary task called filling in frames (FIF). With FIF, we apply a temporal mask to the input mel-spectrogram and encourage the converter to fill in missing frames based on surrounding frames. This task allows the converter to learn time-frequency structures in a self-supervised manner and eliminates the need for an additional module such as TFAN. A subjective evaluation of the naturalness and speaker similarity showed that MaskCycleGAN-VC outperformed both CycleGAN-VC2 and CycleGAN-VC3 with a model size similar to that of CycleGAN-VC2. Audio samples are available at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html.
id: http://arxiv.org/abs/2102.12841v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.