Hiya, thanks for your great work, it really does work well.
I was trying to train my own model and hit on a few small hitches that I hope to solve for future users of the code:
When going over the instructions to resample the VCTK files, it took me a while to debug that you need to pass the full sample rate (e.g. 16000) and not the sample rate in KHz, as is given in the example command-lines (16). A small update to the readme would be nice for future users (e.g. --target_sr 4 -> --target_sr 4000).
Converting the audio from .flac (VCTK default) to .wav is also a necessary step for the scripts (though note that sox has no issue reading .flac files)
Creating configuration & dataset .yaml files: I copied the provided 4-16 files and modified appropriately.
One of the files, p271/p271_069_mic1.wav has exactly 96001 samples, so for the 12-48KHz task, its length is rounded up to 2 sections for the high-res dataset, but after downsampling it's rounded down to 24000 samples, so it's only one section. This breaks the training code unfortunately. I fixed it by manually trimming the file by one sample (sox p271_069_mic1.wav p271_trimmed.wav trim 0 -1s) and regenerating the .egs files.
I was left with a few questions:
The paper says the setting for 12-48 was nfft=1024 and hop=256, batch size 8. The models you provide in the drive say 512/256 and 512/128. What should I believe?
Does the code support resuming a training run?
Thanks again for the paper and the code - the results are good and the code is as well.
Hiya, thanks for your great work, it really does work well.
I was trying to train my own model and hit on a few small hitches that I hope to solve for future users of the code:
--target_sr 4
->--target_sr 4000
)..yaml
files: I copied the provided 4-16 files and modified appropriately.p271/p271_069_mic1.wav
has exactly 96001 samples, so for the 12-48KHz task, its length is rounded up to 2 sections for the high-res dataset, but after downsampling it's rounded down to 24000 samples, so it's only one section. This breaks the training code unfortunately. I fixed it by manually trimming the file by one sample (sox p271_069_mic1.wav p271_trimmed.wav trim 0 -1s
) and regenerating the.egs
files.I was left with a few questions:
Thanks again for the paper and the code - the results are good and the code is as well.