RepCodec: A Speech Representation Codec for Speech Tokenization
RepCodec is a speech tokenization method for converting a speech waveform into a sequence of discrete semantic tokens. The main idea is to train a representation codec which learns a vector quantization codebook through reconstructing the input speech representations from speech encoders like HuBERT or data2vec. Extensive experiments show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation. Also, RepCodec generalizes well across various speech encoders and languages.
Feature Type | Speech Data | RepCodec Model |
---|---|---|
HuBERT base layer 9 | Librispeech train-clean-100 | hubert_base_l9 |
HuBERT large layer 18 | Librispeech train-clean-100 | hubert_large_l18 |
data2vec base layer 6 | Librispeech train-clean-100 | data2vec_base_l6 |
data2vec large layer 18 | Librispeech train-clean-100 | data2vec_large_l18 |
Whisper medium layer 24 | Librispeech train-clean-100 | whisper_medium_l24 |
Whisper large-v2 layer 32 | Librispeech train-clean-100 | whisper_large_l32 |
Please first install RepCodec by
git clone https://github.com/mct10/RepCodec.git
cd RepCodec
pip install .
We used Python 3.9.18 and PyTorch 1.12.1 to test the usage, but the code should be compatible with other recent Python and PyTorch versions.
We adapt the dump_hubert_feature.py
script
from fairseq
to support dumping representations from data2vec, HuBERT, or Whisper encoders.
If you use our script (examples/dump_feature.py
), please also install the following packages:
pip install npy_append_array soundfile
Additionally, if you want to dump representations from
data2vec or HuBERT: please follow fairseq's instruction to install the latest fairseq.
Whisper: please follow Whispers'instruction to install the latest Whisper.
Then, you can follow the given examples to dump representations:
# Example 1: dump from HuBERT base layer 9
# (for data2vec, simply change "model_type" to data2vec and "ckpt_path" to the path of data2vec model)
layer=9
python3 examples/dump_feature.py \
--model_type hubert \
--tsv_path /path/to/tsv/file \
--ckpt_path /path/to/HuBERT/model \
--layer ${layer} \
--feat_dir /dir/to/save/representations
# Example 2: dump from Whisper medium layer 24
layer=24
python3 examples/dump_feature.py \
--model_type whisper \
--tsv_path /path/to/tsv/file \
--whisper_root /directory/to/save/whisper/model \
--whisper_name medium \
--layer ${layer} \
--feat_dir /dir/to/save/representations
Explanations about the args:
model_type: choose from data2vec
, hubert
, and whisper
.
tsv_path: path of the tsv file. Should have the format of
/dir/to/dataset
path_of_utterance_1 number_of_frames
path_of_utterance_2 number_of_frames
You can follow this script to generate the tsv file.
For example, by running
python wav2vec_manifest.py \
/dir/to/LibriSpeech/dev-clean \
--dest /dir/to/manifest \
--ext flac \
--valid-percent 0
you can obtain the dev-clean.tsv
in /dir/to/manifest
for LibriSpeech. (By default, the output file name
is train.tsv
. Remember to rename the file.)
It should be similar to:
/dir/to/LibriSpeech/dev-clean
2277/149896/2277-149896-0026.flac 78720
2277/149896/2277-149896-0005.flac 89600
2277/149896/2277-149896-0033.flac 45520
ckpt_path:
must provide for data2vec and HuBERT.
You need to download the model
from data2vec website
or HuBERT website
yourself.
--ckpt_path
is the path of the data2vec/HuBERT model.
whisper_root and whisper_name:
must provide BOTH --whisper_root
and --whisper_name
for Whisper.
If there is no corresponding model in --whisper_root
, the script will download for you.
layer: which Transformer encoder layer of the model should the representations be extracted from. It is 1-based. For example, if layer=9, then the outputs from the 9th Transformer encoder layer are dumped. Range: [1, number of Transformer encoder layers]
feat_dir: The output representations will be saved to ${feat_dir}/0_1.npy
and ${feat_dir}/0_1.len
.
For other useful functionalities (e.g., sharding), please check the argument list in examples/dump_feature.py
.
We expect to have ${feat_dir}/0_1.npy
and ${feat_dir}/0_1.len
in the provided
directory /dir/to/representaitons
.
Also, the tsv file should be the same as the one used in Representation Preparation.
repcodec /dir/to/representaitons \
--model /path/to/repcodec/model \
--tsv_path /path/to/tsv/file \
[--model_config_path /path/to/train/config] \
[--use_gpu] \
[--out_dir /path/to/output]
If you trained the model yourself following Training New RepCodec Models,
please provide the training config file using --model_config_path
.
If you use the model we provide here, then you do not have to provide that.
This command will tokenize the representations and the output discrete tokens will be saved to ${out_dir}/tokens
.
The tokens are in the same order as the provided tsv file.
An example of the output file:
/dir/to/LibriSpeech/dev-clean
2277/149896/2277-149896-0026.flac 696 696 198 198 198 498 ...
2277/149896/2277-149896-0005.flac 696 696 198 198 198 907 ...
2277/149896/2277-149896-0033.flac 696 696 198 198 198 696 ...
Under examples/tokens
, we provide some token files as references. They are obtained from LibriSpeech dev-clean subset
using the 6 types of representations and corresponding RepCodec Models.
Your results should be very similar to ours.
import torch
import yaml
from repcodec.RepCodec import RepCodec
# for feature types of HubERT base & data2vec base, please use repcodec_dim768.yaml;
# for feature types of HuBERT large & data2vec large & Whisper medium, please use repcodec_dim1024.yaml;
# for feature types of Whisper large-v2, please use repcodec_dim1280.yaml
config = "repcodec/configs/repcodec_dim768.yaml"
with open(config) as fp:
conf = yaml.load(fp, Loader=yaml.FullLoader)
model = RepCodec(**conf)
model.load_state_dict(torch.load("./hubert_base_l9.pkl", map_location="cpu")["model"]["repcodec"])
model.quantizer.initial()
model.eval()
# input shape: (batch size, hidden dim, sequence length)
random_features = torch.randn(size=(1, 768, 100))
with torch.no_grad():
x = model.encoder(random_features)
z = model.projector(x)
_, idx = model.quantizer.codebook.forward_index(z.transpose(2, 1))
tokens = idx.cpu().data.numpy().tolist()[0]
We use a config file to set up all the training configurations, e.g., data, model architecture, optimizer, scheduler. We provide an example here.
Please first install required packages following Installation and prepare the representations following Representation Preparation.
The input data directory is expected to have the following structure
/dir/to/representations/
train_set_name/
0_1.npy
0_1.len
valid_set_name/
0_1.npy
0_1.len
test_set_name/
0_1.npy
0_1.len
The names of subsets should be the same as the fields in the config file.
Then, you can run training by
python train.py \
-c /path/to/config/file \
--tag $tag \
--exp_root exp
tag
is the name of the output folder.
All outputs will be saved to exp_root/tag/
.
Feature Type | Speech Data | Vocoder | f0 quantizer |
---|---|---|---|
HuBERT large layer 18 | VCTK | hubert_large_l18 | vctk_v0_vq |
We train our vocoders following facebookresearch/speech-resynthesis.
Please install necessary packages and follow detailed instructions there.
We provide only an example for VCTK dataset here. All commands should be run under the directory of speech-resynthesis
.
Data preparation
Please download VCTK here, run preprocessing, and train a RepCodec model on it.
Then you can prepare the data as the format of this file. Note that you can keep the key "hubert" unchanged and simply replace the unit sequences with RepCodec unit sequences.
Train
First, you need to train a F0 Quantizer Model by running
python -m torch.distributed.launch --nproc_per_node ${NUM_GPU} train_f0_vq.py \
--checkpoint_path checkpoints/vctk_f0_vq \
--config configs/VCTK/f0_vqvae.json
Then, you can train a vocoder by
python -m torch.distributed.launch --nproc_per_node ${NUM_GPU} train.py \
--checkpoint_path checkpoints/vctk_repcodec_hubert_large \
--config configs/VCTK/repcodec_hubert_large_l18.json
The config file is the same as this
except we use a num_embeddings
of 1024.
You may want to change the paths of input_training_file
and input_validation_file
in the config file as well.
And the data format is the one mentioned above.
If you use the f0 quantizer we provide, you also need to change f0_quantizer_path
.
Inference
You can run inference by
python inference.py \
--checkpoint_file checkpoints/vctk_repcodec_hubert_large/g_00400000 \
-n 5000 \
--vc \
--input_code_file datasets/VCTK/repcodec_hubert_large_l18/test.txt \
--output_dir generations_multispkr
The format of input_code_file
is also the one mentioned above.
Our implementation is based on facebookresearch/AudioDec. We thank them for open-sourcing their code!
If you find our work useful, please cite the following article.
@misc{huang2023repcodec,
title={RepCodec: A Speech Representation Codec for Speech Tokenization},
author={Zhichao Huang and Chutong Meng and Tom Ko},
year={2023},
eprint={2309.00169},
archivePrefix={arXiv},
primaryClass={eess.AS}
}