Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training
summary: One-shot voice conversion(VC) aims to change the timbre of any source speech
to match that of the unseen target speaker with only one speech sample.
Existing style transfer-based VC methods relied on speech representation
disentanglement and suffered from accurately and independently encoding each
speech component and recomposing back to converted speech effectively. To
tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to
build a disentangled encoder, and Zipformer blocks to build a style transfer
decoder as the generator. In the decoder, we used effective styleformer blocks
to integrate speaker characteristics into the generated speech effectively. The
models used the generative VAE loss for encoding components and triplet loss
for unsupervised discriminative training. We applied the styleformer method to
Zipformer's shared weights for style transfer. The experimental results show
that the proposed model achieves comparable subjective scores and exhibits
improvements in objective metrics compared to existing methods in a one-shot
voice conversion scenario.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training
summary: One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the unseen target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics into the generated speech effectively. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer's shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.
id: http://arxiv.org/abs/2409.01668v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.