primepake / wav2lip_288x288

MIT License
530 stars 136 forks source link

Use Chinese dataset to train expert lip-sync discriminator #81

Closed Amber-Believe closed 7 months ago

Amber-Believe commented 7 months ago

I use Chinese data set to train expert lip-sync discriminator, and the train loss remains at 0.69. Do you have such a situation?How should this situation be resolved

ghost commented 7 months ago

did you filter your data?

Amber-Believe commented 7 months ago

您过滤了您的数据了吗?

What do we mean by filtering? So far I have processed the data to fps 25 audio with a sampling rate of 16khz

Amber-Believe commented 7 months ago

did you filter your data?

The current data is in Chinese, including 50 different people and 7700 videos; The English data contains 40 people and 5,500 videos. Is it due to the small amount of data or other reasons. Our data has not been processed by syncnet_python. We found that the model syncnet_v2. model is not available. Can you provide it? That's the problem right now

einsqing commented 7 months ago

did you filter your data?

The current data is in Chinese, including 50 different people and 7700 videos; The English data contains 40 people and 5,500 videos. Is it due to the small amount of data or other reasons. Our data has not been processed by syncnet_python. We found that the model syncnet_v2. model is not available. Can you provide it? That's the problem right now

syncnet_v2 模型是英文模型,中文不适用,你需要训练一个中文的

ghost commented 7 months ago

how's about your config? lr, bs?

Amber-Believe commented 7 months ago

how's about your config? lr, bs?

All Settings are set by default, such as lr=1e-3 when training expert lip-sync discriminator hparams.txt

ghost commented 7 months ago

1e-3 is too large, you can choose 1e-4 or 1e-5

Amber-Believe commented 7 months ago

1e-3 is too large, you can choose 1e-4 or 1e-5 Okay, I'll try

Amber-Believe commented 7 months ago

1e-3 is too large, you can choose 1e-4 or 1e-5

Thank you very much for your advice. After the lr is adjusted to 1e-5, the loss begins to decrease. What is the appropriate learning rate for wav2lip training? 1e-4?

ghost commented 7 months ago

1e-4 is good

Amber-Believe commented 7 months ago

Thank you!

Nyquist0 commented 6 months ago

Hi @Amber-Believe. May I ask what dataset are you using? Is it CMLR? LRS-1000? Or a private one?

ChengsongLu commented 6 months ago

Hi @Amber-Believe. May I ask what dataset are you using? Is it CMLR? LRS-1000? Or a private one?

@Amber-Believe BTW, is the eval loss below 0.3 after changing lr from 1e-3 to 1e-5? And is there anything else you have done to achieve that?

Nyquist0 commented 6 months ago

Hi @primepake I was re-directed to this page from https://github.com/primepake/wav2lip_288x288/issues/97

But I still did not figure out my question. The dataset I am using is LRS2. Because the official wav2lip algorithm use that for training. So I am assuming that should be filtered. And I also randomly checked some audios and videos of the dataset. The wav files are in 16khz sample rate and video files are in 25 fps.

And I would like to ask when do you think the syncnet training would be converged... Will it still be stuck on 0.69 for a long time? (110k steps for me currently..)

Look for your reply. Thanks.

ghost commented 6 months ago

again, how's about your lr? bs? num of gpus?

Nyquist0 commented 6 months ago

LR is 1e-4, and BS is 64, 1 RTX A6000 GPU.

Nyquist0 commented 6 months ago

@primepake Greetings! I am trying to following the pipeline you proposed here

And may I ask how could you pre-process the video data? Are you using the preprocess code from the official wav2lip code?

MarwanAj commented 6 months ago

did you filter your data?

The current data is in Chinese, including 50 different people and 7700 videos; The English data contains 40 people and 5,500 videos. Is it due to the small amount of data or other reasons. Our data has not been processed by syncnet_python. We found that the model syncnet_v2. model is not available. Can you provide it? That's the problem right now

how much is the average length per video of the videos of the datasets ?