maum-ai / phaseaug

ICASSP 2023 Accepted
https://maum-ai.github.io/phaseaug/
BSD 3-Clause "New" or "Revised" License
188 stars 14 forks source link
gan speech-synthesis vocoder

PhaseAug

PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping
Junhyeok Lee, Seungu Han, Hyunjae Cho, Wonbin Jung @ MINDsLab Inc., SNU, KAIST

arXiv GitHub Repo stars githubio

Abstract : Previous generative adversarial network (GAN)-based neural vocoders are trained to reconstruct the exact ground truth waveform from the paired mel-spectrogram and do not consider the one-to-many relationship of speech synthesis. This conventional training causes overfitting for both the discriminators and the generator, leading to the periodicity artifacts in the generated audio signal. In this work, we present PhaseAug, the first differentiable augmentation for speech synthesis that rotates the phase of each frequency bin to simulate one-to-many mapping. With our proposed method, we outperform baselines without any architecture modification. Code and audio samples will be available at https://github.com/maum-ai/phaseaug.

Accepted to ICASSP 2023

phasor

TODO

Use PhaseAug


Authors recommend to read codes from [PITS](https://github.com/anonymous-pits/pits) for complicated application.  

## Requirements
- [PyTorch>=1.7.0](https://pytorch.org/) for [alias-free-torch](https://github.com/junjun3518/alias-free-torch)
- Support PyTorch>=2.0.0
- The requirements are highlighted in [requirements.txt](./requirements.txt).
- We also provide docker setup [Dockerfile](./Dockerfile).

docker build -t=phaseaug --build-arg USER_ID=$(id -u) --build-arg GROUP_ID=$(id -g) --build-arg USER_NAME=$USER

- Cloned [official HiFi-GAN repo](https://github.com/jik876/hifi-gan).
- Downloaded [LJ Speech Dataset](https://keithito.com/LJ-Speech-Dataset/).
- (optional) [MelGAN](https://github.com/descriptinc/melgan-neurips) generator

## Training
0. Clone this repository and copy python files to hifi-gan folder
```bash
git clone --recursive https://github.com/maum-ai/phaseaug
cp ./phaseaug/*.py ./phaseaug/hifi-gan/
cd ./phaseaug/hifi-gan
  1. Modify dataset path at train.py

     parser.add_argument('--input_wavs_dir',
                         default='path/LJSpeech-1.1/wavs_22k')
     parser.add_argument('--input_mels_dir',
                         default='path/LJSpeech-1.1/wavs_22k')
  2. Run train.py

    python train.py --config config_v1.json --aug --filter --data_ratio {0.01/0.1/1.} --name phaseaug_hifigan
    python train.py --config config_v1_melgan.json --aug --filter --data_ratio {0.01/0.1/1.} --name phaseaug_melgan

References

This implementation uses code from following repositories:

This README and the webpage for the audio samples are inspired by:

Citation & Contact

If this repostory useful for yout research, please consider citing!

@INPROCEEDINGS{phaseaug,
  author={Lee, Junhyeok and Han, Seungu and Cho, Hyunjae and Jung, Wonbin},
  booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={PhaseAug: A Differentiable Augmentation for Speech Synthesis to Simulate One-to-Many Mapping},
  year={2023},
  volume={},
  number={},
  pages={1-5},
  doi={10.1109/ICASSP49357.2023.10096374}
}

Bibtex is updated to ICASSP 2023 version. Please note that page numbers are temporary numbers.

If you have a question or any kind of inquiries, please contact Junhyeok Lee at jun3518@icloud.com