Need Help - Githubissues

kavita-gsphk commented 5 months ago

It seems like you have tried to train a modified version of wav2lip_288*288. It would be a great help if you could help me with the below problem.

I am training syncnet on avspeech dataset with train_syncnet_sam.py from above mention repo. My training loss is stuck at 0.69 even after 500k steps. Lr and bs are 5e-5 and 64 , respectively.

I have tried different lr but it didn't work. How can I solve this problem based on your experience?

For preprocessing, I followed all the steps suggested here except the video split part. My videos average length is 7.1s (videos are in range 0-15s) and total length of training dataset is roughly 30.5hr

Thank you so much!

lililuya commented 5 months ago

I 've encountered the same problem in our own dataset. In my view, the loss stick in 0.69 strongly correlated to the dataset, the loss of training wav2lip288 or 384 can be converge in LRS2 dataset. I got the final loss at about 0.32 in step 130856 , but as you know the LRS2 is a low resolution dataset, the result is not good. As the author mentioned，the dataset need to be carefully processed in following steps:

1.download dataset
2.convert to 25fps.
3.change sample rate to 16000hz.
4.split video less than 5s.
5.using syncnet_python to filter dataset in range [-3, 3], model works best with [-1,1].
6.detect faces.

You may need carefully process your dataset. But the loss of syncnet is really hard to converge, the paper 《SIDGAN：High-Resolution Dubbed Video Generation via Shift-Invariant Learning》give an explaination of why in high resolution, the syncnet is not stable. Hope this can help you.

kavita-gsphk commented 5 months ago

Thank you so much for the response. I will look into it.

lililuya commented 5 months ago

Thank you so much for the response. I will look into it.

By the way, the avspeech maybe so dirty, when you process the data, you need carefully consider the scenario where contrains more than one people and the audio is not the specific speaker(You need the GT to be True, Or it's nonsense).

Becasue of the fully convolutional structure and GAN-based method, this has a huge negative impact on training the syncnet. Maybe you can train first on clear dataset like LSR2 or Voxceleb2.
You can also check the case mentioned above, just random sample some video case feed in sync_python to get the confidence and AV offset.
I've search some methods about correct the offset, I lost the link, so I paste this pic here
I used avspeech train the syncnet before, so I get some record, please don't mind if it is useless
Finally, you can also refer the issue in Wav2lip288 or 384 repo to get answer.

kavita-gsphk commented 5 months ago

My loss has finally started decreasing from 0.69, although it's progressing very slowly. Thank you for all your help.

I noticed you attempted to implement learning from the SIDGAN paper in your repository. Were you able to achieve any results with that?

lililuya commented 5 months ago

Hi, happy to hear the loss decreasing, As for SIDGAN, the implemention of the core-part APS is in repo, but the oritention of the filter is only limited at vertical direction, just like the paper mentioned: 5f57f3871c58ecb7e23c864cc4eb740 So, you need implement a horizontal version of APS filter or you can get in touch with the author to the detail. I finally failed to add this part in wav2lip. Sorry for unable to provide you with more effective help.

lililuya / Wav2Lip288

Need Help #3