Thanks to Microsoft's implementation of PriorGrad, which I use as a base of this implementation.
This repository is a work-in-progress and does not produce good outputs yet. Stay tuned!
To begin, create a python environment using your method of choice.
Then, run the following to install requirements
pip install -r requirements.txt
I use accelerate
for data parallel training. Even if you only wish to train
on a single-device, run the following command:
accelerate config
If you want to run with PyTorch 2.0 support, run the following:
pip install --pre torch --force-reinstall --index-url https://download.pytorch.org/whl/nightly/cu{cuda version}
replacing {cuda version}
with your installed CUDA version (eg: 118
)
Download LJSpeech and extract to dataset/ljspeech/
.
Create the train
, valid
, and test
splits by creating text files in
dataset/ljspeech/
with names train.txt
, valid.txt
, test.txt
, containing
the line separated audio file paths.
Run training using the following command:
accelerate launch train.py --config-path {optional-path-to-config-yaml}
Checkpoints will be stored in exp/{date}_{time}
.
Once training is complete, run the following to run the inference loop on a file of your choice:
accelerate launch inference.py {input-wav-file} {output-wav-file} --resume-dir {checkpoint-dir}
This script simply takes an input file, computes a mel-spectrogram, and attempts to reconstruct the waveform using the mel-spectrogram alone.
TODO: explain how to configure using yaml and command line
SpecGrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani
@article{koizumi2022specgrad, title={SpecGrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping}, author={Koizumi, Yuma and Zen, Heiga and Yatabe, Kohei and Chen, Nanxin and Bacchiani, Michiel}, journal={arXiv preprint arXiv:2203.16749}, year={2022} }