Closed george-roussos closed 3 years ago
Thanks @george-roussos Maybe i'll try with default mozilla config values (except data path) to just ensure that my setup and dependencies are working correct - not for creating a useable model.
I never received this, but I have only trained using the fork here https://github.com/freds0/wavegrad Then I took the spectrogram from TTS and passed it there
Can you check if you are using the original LR etc. values?
Hey @george-roussos . I've decided to give repo from freds0 a try. Our taco2 model uses a hop length from 256, but in wavegrad params.py is hop_samples set to 300 with hint not to change this value. Is this a problem?
Hey @george-roussos . I've decided to give repo from freds0 a try. Our taco2 model uses a hop length from 256, but in wavegrad params.py is hop_samples set to 300 with hint not to change this value. Is this a problem?
You linked to the original repository.
freds0's fork has the needed adaptation to work with mozilla's tts. https://github.com/freds0/wavegrad/blob/master/src/wavegrad/params.py (only the stats_path is missing i guess)
Sorry @SanjaESC for the confusion. My fault - too late and too many open browser tabs ;-). Training is using the right repo. What confuses me is that Tensorboard doesn't producing audio samples (the only one shown has 0 seconds). Will audio samples be produced on further training steps?
I think that is normal. When I used it it would only produce those 1 sec audio samples
Okay thanks. I'll keep training running. Just one question. Should i continue my questions/discussions on WaveGrad training progress on "thorsten" dataset in Mozilla discourse to not overtax this issue thread?
I'll go crazy . I tried several runs with freds0 training, double checked all config params (and adjusted them in freds0 if needed) and still have no success.
I tried these steps:
Forked freds0 repo
Adjusted following values in code (to match our taco2 model config)
mozilla_tts_audio.py: do_trim_silence=True (was false) mozilla_tts_audio.py: do_sound_norm=True (was false) params.py: preemphasis=0.0 (was 0.98) params.py: max_norm=1.0 (was 4.0) params.py: do_trim_silence=True (was false)
Build WaveGrad with "pip install ."
"pip list" shows wavegrad installed (venv environment):
wavegrad 0.1.2
Running preprocessing "python -m wavegrad.preprocess /path/to/dir/containing/wavs"
Created a spectogram (*.npy) file for every wav file in my wav dataset directory
python -m wavegrad /path/to/model/dir_for_this_training /path/to/dir/containing/wavs
Training is running (slow but constant) and checkpoints are written
Now i want to try if our taco2 model is compatible with the WaveGrad training:
And this gives following error:
RuntimeError: Given groups=1, weight of size [768, 80, 3], expected input[1, 112, 80] to have 80 channels, but got 112 channels instead
Is this just a format definition error in array order or a configurational mismatch? I've no more ideas and i'm about to give up on WaveGrad.
Hi, sorry, this sounds frustrating. How do you extract the spectrogram from synthesize.py
? I do it like this:
It always worked okay for me. My params.py
looked like this:
params = AttrDict(
# Training params
batch_size=48,
learning_rate=2e-4,
max_grad_norm=1.0,
# upsample factors
#factors=[5, 5, 3, 2, 2], # 5*5*3*2*2=300 (hop_lenght=300)
factors=[4, 4, 4, 2, 2], # 4*4*4*2*2=256 (this is necessary to be equal to hop_lenght)
# Audio params
num_mels=80,
fft_size=1024,
sample_rate=22050,
win_length=1024,
hop_length=256, # if you change that change factors
frame_length_ms=None,
frame_shift_ms=None,
preemphasis=0.98,
min_level_db=-100,
ref_level_db=20,
power=1.5,
griffin_lim_iters=60,
stft_pad_mode="reflect",
signal_norm=True,
symmetric_norm=True,
max_norm=4.0,
clip_norm=True,
mel_fmin=0.0,
mel_fmax=8000.0,
spec_gain=20.0,
do_trim_silence=False,
trim_db=60,
# Data params
crop_mel_frames=24,
# Model params
noise_schedule=np.linspace(1e-6, 0.01, 1000).tolist(),
)
Some observations: I did not perform mean-var. So I am not sure where do_sound_norm
should be set to True
or not, even if you are doing mean-var training. Just try adding your stats_path and let do_sound_norm
set to False
if it ends up not working. But this all above is assuming WaveGrad is not working correctly. If it is, then it is not relevant and your configuration is correct.
Is this just a format definition error in array order or a configurational mismatch?
I guess you are trying to load the .npy file created during preprocess? I think it wasn't intended for inference just for training that's why you get the mismatch. You could use transpose to match the expected order.
But the easiest way is to test it on synthesized spectrograms directly from TTS. @george-roussos linked an examples above, how you can save it to an npy file. Further above is also example code to run it directly. https://github.com/mozilla/TTS/issues/518#issuecomment-696117318
@george-roussos :
Hi, sorry, this sounds frustrating.
Yes, a little but I knew what I was getting into, so I'll keep trying. My code modifications looks similar as i received a comparable code snipplet from @domcross too. I started working on a pr yesterday for saving spectograms during synthesize.py. I'll check your ideas as soon as possible, thanks.
@SanjaESC :
I guess you are trying to load the .npy file created during preprocess?
No, i saved the spectograms in synthesize.py called from tts server speaking a random phrase (no ground truth). So hopefully this is the right spectogram for further processing through WaveGrad.
Even if it is complex I don't plan to give up, so thanks for your great support guys :-).
Could you share the code you are using to save it? The problem lays in the format you are exporting the spectrogram in.
around line 180 in synthesizer.py (dev branch):
else:
# use GL
if self.use_cuda:
postnet_output = postnet_output[0].cpu()
# Save spectogram for further processing in wavegrad
# =========================
spec_filename = "spec-" + sen.replace(" ", "_")
np.save("/tmp/" + spec_filename + ".npy", postnet_output)
# =========================
else:
postnet_output = postnet_output[0]
postnet_output = postnet_output.numpy()
wav = inv_spectrogram(postnet_output, self.ap, self.tts_config)
btw. the tts-function definition is the following:
def tts(self, text, speaker_id=None):
try this
# use GL
if self.use_cuda:
# Save spectogram for further processing in wavegrad
# =========================
spec_filename = "spec-" + sen.replace(" ", "_")
np.save("/tmp/" + spec_filename + ".npy", postnet_output[0].T)
# =========================
postnet_output = postnet_output[0].cpu()
else:
postnet_output = postnet_output[0]
postnet_output = postnet_output.numpy()
wav = inv_spectrogram(postnet_output, self.ap, self.tts_config)
Thanks @SanjaESC . This leads to following error:
can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
I played around with adding .cpu, etc. (found some tips on googling that error) without success. Would it help to check another vocoder than GL?
np.save("/tmp/" + spec_filename + ".npy", postnet_output[0].T.cpu())
should also work i guess.
Should i continue my questions/discussions on WaveGrad training progress on "thorsten" dataset in Mozilla discourse to not overtax this issue thread?
I guess at this point it would be a good idea. ^^ You are welcome to write me a PM at discourse.
One simple trick to run WaveGrad even faster.
It is able to generate way higher quality in 6 iterations compared to the normal run. And you can even generate good quality in 3 iterations.
Might the same be true if the output of say Multi-Band MelGAN were fed to WaveGrad?
I took code snipplets from @SanjaESC and @domcross and put together this pr (https://github.com/mozilla/TTS/pull/580). Maybe it's more easy for future beginners (like me) to work with models raw spectrograms. Thank you guys for your great support.
One simple trick to run WaveGrad even faster.
- Generate a rough waveform with Griffin-Lim
- Give it to WaveGrad and use it instead of random noise as input.
It is able to generate way higher quality in 6 iterations compared to the normal run. And you can even generate good quality in 3 iterations.
Does the model need to be retrained ? Since the input is no longer random noise, but waveform.
retrain is better but it should also work otherwise.
With kind permission from @SanjaESC i've send a pr for improve robustness for synthesize with wavegrad. https://github.com/mozilla/TTS/pull/602
@nmstoker you've had same issue i ran into too (https://github.com/mozilla/TTS/issues/518#issuecomment-727254570)
AttributeError: 'NoneType' object has no attribute 'to'
I've no idea if you still encounter this problem but it should be addressed with this pr. Thanks @SanjaESC for your 🥇 first class support and permission to pack you knowhow into a pr.
One simple trick to run WaveGrad even faster.
- Generate a rough waveform with Griffin-Lim
- Give it to WaveGrad and use it instead of random noise as input.
It is able to generate way higher quality in 6 iterations compared to the normal run. And you can even generate good quality in 3 iterations.
@erogol hello, I do the experiment in your way you suggestted, but i find waveform with Griffin-Lim is less than the noise ,and the missing size is hop_length . How do you deal with Griffin-Lim waveform ?
I am trying to use it with the model https://discourse.mozilla.org/t/creating-a-github-page-for-hosting-community-trained-models/70889/11 and getting the following error.
(TTS) irfan@gini:~/PycharmProjects/TTS$ python TTS/bin/synthesize.py --text "Text for TTS" --model_path /home/irfan/Downloads/ek/ek1_model.pth.tar --config_path /home/irfan/Downloads/ek/model_config.json --out_path speech.wav --vocoder_path /home/irfan/Downloads/ek/ek1_vocoder.pth.tar --vocoder_config_path /home/irfan/Downloads/ek/vocoder_config.json --use_cuda True
Traceback (most recent call last):
File "TTS/bin/synthesize.py", line 13, in <module>
from TTS.tts.utils.generic_utils import setup_model
File "/home/irfan/PycharmProjects/TTS/TTS/tts/utils/generic_utils.py", line 7, in <module>
from TTS.utils.generic_utils import check_argument
File "/home/irfan/PycharmProjects/TTS/TTS/utils/generic_utils.py", line 6, in <module>
from contextlib import nullcontext
ImportError: cannot import name 'nullcontext'
How can I fix this?
I'm using Python 3.6.9 on Ubuntu 18
retrain is better but it should also work otherwise.
This is a great idea. I'm sorry I am not really experienced with these diffusion models, so I would like to ask for advice: how would you exactly train using griffin-lim as noise input?
Would def compute_y_n(self, y_0):
take an additional noise
parameter with the griffin-lim waveform instead of using noise = torch.randn_like(y_0)
? Would this be sufficient?
If so, how should the inference
method work? Will z = torch.randn_like(y_n)
always be the griffin-lim waveform? I am a bit lost.
Thanks for your help.
It is the latter. Basically, you don't start off with a random noise but the GL output to diffuse.
BTW I don't maintain this code anymore. Come to :frog: TTS
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts
This is not an issue and is more of a discussion. I read the WaveGrad paper today (which may be found here) and listened to the samples here, which sound very good. There seems to be an open source implementation already here with great progress. Has anyone read the paper or used this implementation?