kan-bayashi / ParallelWaveGAN

Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch
https://kan-bayashi.github.io/ParallelWaveGAN/
MIT License
1.57k stars 343 forks source link

Training StyleMelGan on custom dataset #329

Closed skol101 closed 2 years ago

skol101 commented 2 years ago

Hello again :)

Are lab mono files required to do the training or that step can be skipped using this script https://gist.github.com/kan-bayashi/eceafcd35a2351f5f6bf89a1ccb956e9 ?

kan-bayashi commented 2 years ago

The use of .lab file is a kind of special setting only for VCTK since it contains much silence. (In PWG, longer silence affects the performance.) Usually, we do not need to use it. You can follow the following page to make recipe. https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#how-to-make-the-recipe-for-your-own-dateset

skol101 commented 2 years ago

What if I use vctk + custom dataset with silence removed? Btw, I managed to create mono labels using merlin, but when running further along the script I got the erro in train_nodev_all preprocessing log

 File "/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/numpy/core/_methods.py", line 40, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
Accounting: time=3 threads=1
kan-bayashi commented 2 years ago

In such a case, the script maybe suitable, you can skip the lab file creation for your trimmed audio. https://gist.github.com/kan-bayashi/eceafcd35a2351f5f6bf89a1ccb956e9 Note that only wav.scp and segments are needed for the vocoder training.

Btw, I managed to create mono labels using merlin, but when running further along the script I got the erro in train_nodev_all preprocessing log n_fft=2048 is too small for input signal of length=121 warnings.warn(

Not sure but your audio seems too short, maybe you can simply filter out them.

kan-bayashi commented 2 years ago

Wait, I remembered the mixing the directory with segments and one without segments is not supported in this repo's vctk data prep script. https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/vctk/voc1/run.sh#L73 Another options are: a. Reuse dump directories created in espnet https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#run-training-using-espnet2-tts-recipe-within-5-minutes b. Use template recipe as is (it does not use lab file) and perform trimming in feature extraction stage via https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_multi_spk/voc1/conf/parallel_wavegan.v1.yaml#L18

Since the vocoder training uses randomly cropped segments as the batch, the trimming accuracy is not so important, you can do aggressive trimming.

skol101 commented 2 years ago

I actually posted the warning above, but looks like the error is this: 29%|██▊ | 4/14 [00:00<00:00, 33.51it/s] Traceback (most recent call last): File "/home/sk/anaconda3/envs/vc/bin/parallel-wavegan-preprocess", line 8, in <module> sys.exit(main()) File "/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/parallel_wavegan/bin/preprocess.py", line 182, in main np.abs(audio).max() <= 1.0 File "/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/numpy/core/_methods.py", line 40, in _amax return umr_maximum(a, axis, None, out, keepdims, initial, where) ValueError: zero-size array to reduction operation maximum which has no identity

kan-bayashi commented 2 years ago

You should check the audio array. It seems it was zero-length. E.g., soxi /path/to/doubtful_audio_file

skol101 commented 2 years ago

Under same conda env I've just ran yesno recipe. The script is training!

Successfully finished feature extraction of dev set.
Successfully finished feature extraction of eval set.
Successfully finished feature extraction of train_nodev set.
Successfully finished feature extraction.
Statistics computation start. See the progress via dump/train_nodev/compute_statistics.log.
2022-02-10 17:49:10,876 (compute_statistics:113) INFO: The number of files = 40.
....
[decode]: 100%|██████████| 10/10 [00:00<00:00, 16.29it/s, RTF=0.00135]
2022-02-10 17:49:27,638 (decode:176) INFO: Finished generation of 10 utterances (RTF = 0.013).
Successfully finished decoding of dev set.
Successfully finished decoding of eval set.
Successfully finished decoding.
Finished.

Just in case I will download VCTK-Corpus.tar.gz from udialogue.org, and not use my own local copy of that. Shall see if it makes any difference.

Indeed! The issue was: I was using VCTK with silence removed dataset, but labels were created for non-silence removed dataset.

skol101 commented 2 years ago

Should I still run training with trim_silence: true? Also, a related question when resuming training. I've updated train_max_steps, but not sure how from the start should discriminator start?

discriminator_train_start_steps: 1600000 # Number of steps to start to train discriminator.
train_max_steps: 2000000                # Number of training steps.
kan-bayashi commented 2 years ago

Should I still run training with trim_silence: true?

If you use silence removed audio, you do not need it.

Also, a related question when resuming training. I've updated train_max_steps, but not sure how from the start should discriminator start?

If you want to just make the training longer, you do not need to touch discriminator_train_start_steps. Please use default value 100000 steps, which is a generator pretraining.

skol101 commented 2 years ago

If you use silence removed audio, you do not need it.

Cheers, the issue was that with silenced removed audio training didn't start. I assume you trained vctk styleMelGan on VCTK Corpus with silence, and this option 'trim_silence' was set to false?

If you want to just make the training longer, you do not need to touch discriminator_train_start_steps. Please use default value 100000 steps, which is a generator pretraining.

I see, so it doesn't apply during finetuning.

kan-bayashi commented 2 years ago

Cheers, the issue was that with silenced removed audio training didn't start. I assume you trained vctk styleMelGan on VCTK Corpus with silence, and this option 'trim_silence' was set to false?

I trained VCTK model with segments (created from .lab file), so the silence trimmed audio is loaded in feature extraction and therefore I did not use trim_silence option.

skol101 commented 2 years ago

Is segment training must be manually specified ? On 11 Feb 2022, 14:25 +0200, Tomoki Hayashi @.***>, wrote:

Cheers, the issue was that with silenced removed audio training didn't start. I assume you trained vctk styleMelGan on VCTK Corpus with silence, and this option 'trim_silence' was set to false? I trained VCTK model with segments (created from .lab file), so the silence trimmed audio is loaded in feature extraction and therefore I did not use trim_silence option. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you modified the open/close state.Message ID: @.***>

kan-bayashi commented 2 years ago

Specified here: https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/vctk/voc1/run.sh#L91-L96

kan-bayashi commented 2 years ago

You should replace melspectrogram extraction part. https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/parallel_wavegan/bin/preprocess.py#L25-L78 You can include mean and std normalization. Then, remove normalization part. https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_single_spk/voc1/run.sh#L94-L129

Change ${dumpdir}/hogehoge/norm -> ${dumpdir}/hogehoge/raw https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_single_spk/voc1/run.sh#L154-L155 https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_single_spk/voc1/run.sh#L176

Remove stat copy line. https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_single_spk/voc1/run.sh#L144

skol101 commented 2 years ago

Cheers for this. Almost there! I did everything as you suggested, including replacing melspectrogram extraction part with preprocess(wave).

When training starts at iteration 0 I get error: ParallelWaveGAN/parallel_wavegan/bin/train.py", line 601, in __call__ c_batch = torch.tensor(c_batch, dtype=torch.float).transpose(2, 1) # (B, C, T') IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 2

kan-bayashi commented 2 years ago

Please check the shape of outputs of replaced melspectrogram function. It needs to be like: https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/parallel_wavegan/bin/preprocess.py#L54

skol101 commented 2 years ago

Thank you! That really helped!