Closed kavita-gsphk closed 4 weeks ago
I 've encountered the same problem in our own dataset. In my view, the loss stick in 0.69
strongly correlated to the dataset, the loss of training wav2lip288
or 384
can be converge in LRS2
dataset. I got the final loss at about 0.32
in step 130856
, but as you know the LRS2
is a low resolution dataset, the result is not good.
As the author mentioned,the dataset need to be carefully processed in following steps:
You may need carefully process your dataset. But the loss of syncnet is really hard to converge, the paper 《SIDGAN:High-Resolution Dubbed Video Generation via Shift-Invariant Learning》give an explaination of why in high resolution, the syncnet is not stable. Hope this can help you.
Thank you so much for the response. I will look into it.
Thank you so much for the response. I will look into it.
By the way, the avspeech maybe so dirty, when you process the data, you need carefully consider the scenario where contrains more than one people and the audio is not the specific speaker(You need the GT to be True, Or it's nonsense).
My loss has finally started decreasing from 0.69, although it's progressing very slowly. Thank you for all your help.
I noticed you attempted to implement learning from the SIDGAN paper in your repository. Were you able to achieve any results with that?
Hi, happy to hear the loss decreasing, As for SIDGAN, the implemention of the core-part APS is in repo, but the oritention of the filter is only limited at vertical direction, just like the paper mentioned: So, you need implement a horizontal version of APS filter or you can get in touch with the author to the detail. I finally failed to add this part in wav2lip. Sorry for unable to provide you with more effective help.
It seems like you have tried to train a modified version of wav2lip_288*288. It would be a great help if you could help me with the below problem.
I am training syncnet on avspeech dataset with
train_syncnet_sam.py
from above mention repo. My training loss is stuck at 0.69 even after 500k steps. Lr and bs are 5e-5 and 64 , respectively.I have tried different
lr
but it didn't work. How can I solve this problem based on your experience?For preprocessing, I followed all the steps suggested here except the video split part. My videos average length is 7.1s (videos are in range 0-15s) and total length of training dataset is roughly 30.5hr
Thank you so much!