This repository contains inference scripts for SoCodec, an ultra-low-bitrate speech codec, dedicated to speech language models, introduced in the paper titled SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model based Text-To-Speech Synthesis.
Paper
π Demo Site
β Model Weights
π With SoCodec, you can compress audio into discrete codes at an ultra low 0.47 kbps bitrate and a short 120ms frameshift.
π It can be used as a drop-in replacement for EnCodec or other multi-stream codecs for speech language modeling applications.
π The released checkpoint only supports Chinese now. The training of the multi-lingual version is in progress.
Clone the repository and install dependencies:
git clone https://github.com/hhguo/SoCodec
cd SoCodec
mkdir ckpts && cd ckpts
wget https://huggingface.co/TencentGameMate/chinese-hubert-large/resolve/main/chinese-hubert-large-fairseq-ckpt.pt
wget https://huggingface.co/hhguo/SoCodec/resolve/main/socodec_16384x4_120ms_16khz_chinese.safetensors
wget https://huggingface.co/hhguo/SoCodec/resolve/main/mel_vocoder_80dim_10ms_16khz.safetensors
# For analysis-synthesis
python example.py -i ground_truth.wav -o synthesis.wav
# For speech analysis
python example.py -i ground_truth.wav -o features.pt
# For token-to-audio synthesis
python example.py -i features.pt -o synthesis.wav
We provide the pretrained models on Hugging Face Collections.
Model Name | Frame Shift | Codebook Size | Number of Streams | Dataset | ||
---|---|---|---|---|---|---|
socodec_16384x4_120ms_16khz_chinese | 120ms | 16384 | 4 | WenetSpeech4TTS | ||
<!-- | socodec_16384x1_40ms_16khz_chinese | 120ms | 16384 | 4 | WenetSpeech4TTS | --> |
We also provide the pretrained vocoders to convert the Mel spectrogram from socodec to the waveform.
Model Name | Frame Shift | Mel Bins | fmax | Upsampling Ratio | Dataset |
---|---|---|---|---|---|
mel_vocoder_80dim_10ms_16khz | 16 kHz | 80 | 8000 | 160 | WenetSpeech4TTS |