Closed skol101 closed 2 years ago
The use of .lab file is a kind of special setting only for VCTK since it contains much silence. (In PWG, longer silence affects the performance.) Usually, we do not need to use it. You can follow the following page to make recipe. https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#how-to-make-the-recipe-for-your-own-dateset
What if I use vctk + custom dataset with silence removed? Btw, I managed to create mono labels using merlin, but when running further along the script I got the erro in train_nodev_all preprocessing log
File "/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/numpy/core/_methods.py", line 40, in _amax
return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity
Accounting: time=3 threads=1
In such a case, the script maybe suitable, you can skip the lab file creation for your trimmed audio. https://gist.github.com/kan-bayashi/eceafcd35a2351f5f6bf89a1ccb956e9 Note that only wav.scp and segments are needed for the vocoder training.
Btw, I managed to create mono labels using merlin, but when running further along the script I got the erro in train_nodev_all preprocessing log n_fft=2048 is too small for input signal of length=121 warnings.warn(
Not sure but your audio seems too short, maybe you can simply filter out them.
Wait, I remembered the mixing the directory with segments and one without segments is not supported in this repo's vctk data prep script. https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/vctk/voc1/run.sh#L73 Another options are: a. Reuse dump directories created in espnet https://github.com/kan-bayashi/ParallelWaveGAN/tree/master/egs#run-training-using-espnet2-tts-recipe-within-5-minutes b. Use template recipe as is (it does not use lab file) and perform trimming in feature extraction stage via https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_multi_spk/voc1/conf/parallel_wavegan.v1.yaml#L18
Since the vocoder training uses randomly cropped segments as the batch, the trimming accuracy is not so important, you can do aggressive trimming.
I actually posted the warning above, but looks like the error is this:
29%|██▊ | 4/14 [00:00<00:00, 33.51it/s] Traceback (most recent call last): File "/home/sk/anaconda3/envs/vc/bin/parallel-wavegan-preprocess", line 8, in <module> sys.exit(main()) File "/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/parallel_wavegan/bin/preprocess.py", line 182, in main np.abs(audio).max() <= 1.0 File "/home/sk/anaconda3/envs/vc/lib/python3.8/site-packages/numpy/core/_methods.py", line 40, in _amax return umr_maximum(a, axis, None, out, keepdims, initial, where) ValueError: zero-size array to reduction operation maximum which has no identity
You should check the audio
array. It seems it was zero-length.
E.g., soxi /path/to/doubtful_audio_file
Under same conda env I've just ran yesno recipe. The script is training!
Successfully finished feature extraction of dev set.
Successfully finished feature extraction of eval set.
Successfully finished feature extraction of train_nodev set.
Successfully finished feature extraction.
Statistics computation start. See the progress via dump/train_nodev/compute_statistics.log.
2022-02-10 17:49:10,876 (compute_statistics:113) INFO: The number of files = 40.
....
[decode]: 100%|██████████| 10/10 [00:00<00:00, 16.29it/s, RTF=0.00135]
2022-02-10 17:49:27,638 (decode:176) INFO: Finished generation of 10 utterances (RTF = 0.013).
Successfully finished decoding of dev set.
Successfully finished decoding of eval set.
Successfully finished decoding.
Finished.
Just in case I will download VCTK-Corpus.tar.gz from udialogue.org, and not use my own local copy of that. Shall see if it makes any difference.
Indeed! The issue was: I was using VCTK with silence removed dataset, but labels were created for non-silence removed dataset.
Should I still run training with trim_silence: true? Also, a related question when resuming training. I've updated train_max_steps, but not sure how from the start should discriminator start?
discriminator_train_start_steps: 1600000 # Number of steps to start to train discriminator.
train_max_steps: 2000000 # Number of training steps.
Should I still run training with trim_silence: true?
If you use silence removed audio, you do not need it.
Also, a related question when resuming training. I've updated train_max_steps, but not sure how from the start should discriminator start?
If you want to just make the training longer, you do not need to touch discriminator_train_start_steps
.
Please use default value 100000 steps, which is a generator pretraining.
If you use silence removed audio, you do not need it.
Cheers, the issue was that with silenced removed audio training didn't start. I assume you trained vctk styleMelGan on VCTK Corpus with silence, and this option 'trim_silence' was set to false?
If you want to just make the training longer, you do not need to touch
discriminator_train_start_steps
. Please use default value 100000 steps, which is a generator pretraining.
I see, so it doesn't apply during finetuning.
Cheers, the issue was that with silenced removed audio training didn't start. I assume you trained vctk styleMelGan on VCTK Corpus with silence, and this option 'trim_silence' was set to false?
I trained VCTK model with segments (created from .lab file), so the silence trimmed audio is loaded in feature extraction and therefore I did not use trim_silence option.
Is segment training must be manually specified ? On 11 Feb 2022, 14:25 +0200, Tomoki Hayashi @.***>, wrote:
Cheers, the issue was that with silenced removed audio training didn't start. I assume you trained vctk styleMelGan on VCTK Corpus with silence, and this option 'trim_silence' was set to false? I trained VCTK model with segments (created from .lab file), so the silence trimmed audio is loaded in feature extraction and therefore I did not use trim_silence option. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you modified the open/close state.Message ID: @.***>
You should replace melspectrogram extraction part. https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/parallel_wavegan/bin/preprocess.py#L25-L78 You can include mean and std normalization. Then, remove normalization part. https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_single_spk/voc1/run.sh#L94-L129
Change ${dumpdir}/hogehoge/norm
-> ${dumpdir}/hogehoge/raw
https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_single_spk/voc1/run.sh#L154-L155
https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_single_spk/voc1/run.sh#L176
Remove stat copy line. https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/egs/template_single_spk/voc1/run.sh#L144
Cheers for this. Almost there! I did everything as you suggested, including replacing melspectrogram extraction part with preprocess(wave).
When training starts at iteration 0 I get error:
ParallelWaveGAN/parallel_wavegan/bin/train.py", line 601, in __call__ c_batch = torch.tensor(c_batch, dtype=torch.float).transpose(2, 1) # (B, C, T') IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 2
Please check the shape of outputs of replaced melspectrogram function. It needs to be like: https://github.com/kan-bayashi/ParallelWaveGAN/blob/5bef5a0610e5e0d6153b601bbf91c78582260c8f/parallel_wavegan/bin/preprocess.py#L54
Thank you! That really helped!
Hello again :)
Are lab mono files required to do the training or that step can be skipped using this script https://gist.github.com/kan-bayashi/eceafcd35a2351f5f6bf89a1ccb956e9 ?