Open FlashlightET opened 3 years ago
The issue here is due to the model/synthesis quality. A typical alignment graph should look more like this.
If you cannot see a clear line forming, then the alignment is so poor that it won't be able to produce audible results. Please post a message in the #help-wanted channel of our discord if you'd like assistance identifying why the model/synthesis is not good quality
To follow up on this issue, here are some common reasons why your voice may be poor: https://benaandrew.github.io/Voice-Cloning-App/training/#verifying-quality
Thank you for developing this application and sorry for the bothering--I posted my question on discord but did not get any responses. I run into the same issue where the alignment graphs look legit for both the training and synthesis steps, but the generated audio is silent. I'm using the local .exe for the project and every step was done on my computer locally.
According to your instructions on verifying the quality, I have
Also I am pretty sure that my training has passed at least 1000 epochs. For the vocoder, I tried both the one provided by your documentation page and the LJ_FT_T2_V3 from hifi-gan github page, but both gave silent audio but with a legit alignment graph.
Could you provide some suggestions what might be an issue here? Really appreciate it!
What GPU are you using? I remember that some GPUs had some weird issue with some other ML stuff.
I have 2 GPU on my computer--1 is Intel(R) Irus(R) Plus Graphics and the other is NVIDIA GeForce GTX 1660 Ti with Max-Q Design. I followed your video tutorial and installed the drive for NVIDIA GPU, so I always assume NVIDIA GPU is the one that the application is using.
its an issue with 1660/1660TI I think https://github.com/pytorch/pytorch/issues/58123 You could try to use another version of CUDNN. Or try to run the app from source see if removing .half() can fix it like here. https://github.com/NVIDIA/tacotron2/issues/475
Thanks for the suggestions! I will definitely give it a shot. Does this mean my previous trained model is completely useless now and I should start fresh after making the changes as you suggested?
I think the model should be fine as only the inference uses .half() and it produces a normal graph.
im getting silence with a normal graph and a 1660ti
Neither the app nor colab produce audio, but they both do produce graphs File from program: https://cdn.discordapp.com/attachments/879773649714962553/884898833803399188/out.wav File from colab (michael rosen model): https://cdn.discordapp.com/attachments/879773649714962553/884898840522670140/download.wav The colab seems to be generating actual audio though, since it shows a spectrogram, it just doesnt want to make it a wav file?