Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: CPSP: Learning Speech Concepts From Phoneme Supervision
summary: For fine-grained generation and recognition tasks such as
minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic
speech recognition (ASR), the intermediate representation extracted from speech
should contain information that is between text coding and acoustic coding. The
linguistic content is salient, while the paralinguistic information such as
speaker identity and acoustic details should be removed. However, existing
methods for extracting fine-grained intermediate representations from speech
suffer from issues of excessive redundancy and dimension explosion.
Additionally, existing contrastive learning methods in the audio field focus on
extracting global descriptive information for downstream audio classification
tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these
issues, we propose a method named Contrastive Phoneme-Speech Pretraining
(CPSP), which uses three encoders, one decoder, and contrastive learning to
bring phoneme and speech into a joint multimodal space, learning how to connect
phoneme and speech at the frame level. The CPSP model is trained on 210k speech
and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The
proposed CPSP method offers a promising solution for fine-grained generation
and recognition downstream tasks in speech processing. We provide a website
with audio samples.
Thunk you very much for contribution!
Your judgement is refrected in arXivSearches.json, and is going to be used for VCLab's activity.
Thunk you so much.
Please check whether this paper is about 'Voice Conversion' or not.
article info.
title: CPSP: Learning Speech Concepts From Phoneme Supervision
summary: For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representation extracted from speech should contain information that is between text coding and acoustic coding. The linguistic content is salient, while the paralinguistic information such as speaker identity and acoustic details should be removed. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Additionally, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named Contrastive Phoneme-Speech Pretraining (CPSP), which uses three encoders, one decoder, and contrastive learning to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CPSP model is trained on 210k speech and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CPSP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing. We provide a website with audio samples.
id: http://arxiv.org/abs/2309.00424v1
judge
Write [vclab::confirmed] or [vclab::excluded] in comment.