hhguo / SoCodec

Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications
MIT License
58 stars 3 forks source link
audio speech speech-codec speech-language-model tts

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis

Haohan Guo, Fenglong Xie, Kun Xie, Dongchao Yang, Dake Guo, Xixin Wu, Helen Meng

This repository contains inference scripts for SoCodec, an ultra-low-bitrate speech codec, dedicated to speech language models, introduced in the paper titled SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis.

Paper
πŸ“ˆ Demo Site
βš™ Model Weights

πŸ‘‰ With SoCodec, you can compress audio into discrete codes at an ultra low 0.47 kbps bitrate and a short 120ms frameshift.
πŸ‘Œ It can be used as a drop-in replacement for EnCodec or other multi-stream codecs for speech language modeling applications.
πŸ“š The released checkpoint only supports Chinese now. The training of the multi-lingual version is in progress.

News

Installation

Clone the repository and install dependencies:

git clone https://github.com/hhguo/SoCodec
cd SoCodec
mkdir ckpts && cd ckpts
wget https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt
wget https://huggingface.co/hhguo/SoCodec/resolve/main/socodec_16384x4_120ms_16khz_chinese.safetensors
wget https://huggingface.co/hhguo/SoCodec/resolve/main/mel_vocoder_80dim_10ms_16khz.safetensors

Usage

# For analysis-synthesis
python example.py -i ground_truth.wav -o synthesis.wav
# For speech analysis
python example.py -i ground_truth.wav -o features.pt
# For token-to-audio synthesis
python example.py -i features.pt -o synthesis.wav

Pretrained Models

We provide the pretrained models on Hugging Face Collections.

Model Name Frame Shift Codebook Size Number of Streams Dataset
socodec_16384x4_120ms_16khz_chinese 120ms 16384 4 WenetSpeech4TTS
<!-- socodec_16384x1_40ms_16khz_chinese 120ms 16384 4 WenetSpeech4TTS -->

We also provide the pretrained vocoders to convert the Mel spectrogram from socodec to the waveform.

Model Name Frame Shift Mel Bins fmax Upsampling Ratio Dataset
mel_vocoder_80dim_10ms_16khz 16 kHz 80 8000 160 WenetSpeech4TTS

TODO

References