primepake / wav2lip_288x288

MIT License
530 stars 136 forks source link

Overfitting when training syncnet #98

Open ChengsongLu opened 6 months ago

ChengsongLu commented 6 months ago

Appreciate for sharing such great project!

I'm having an issue training syncnet with my own dataset.

loss_epoch

As shown above, the network suffers from severe overfitting. Have you encountered a similar situation? If so, how did you resolve it?

Thanks!

ChengsongLu commented 6 months ago

FYI.

I am using bs=128, lr=1e-4, around 20K clipped video (3s each) with around 200 difference speakers.

I have read some issues in this repo said that sync_correlate could help, but I don't think it will in this situation. If the problem is the dataset itself, the training loss should not going down either?

ghost commented 6 months ago

this is weird issue, can you check your val dataset?

ChengsongLu commented 6 months ago

My val_set is just the random split of the whole dataset. It contains faces that haven't appeared in the training set.

ghost commented 6 months ago

I think you should carefully check your val dataset, maybe it out of sync or the distribution of val data is big different with train data

ChengsongLu commented 6 months ago

To reflect the generalization of the model, should not the distribution of the val data be different from the train data? And I think that's what my current data distribution is, because val data use faces that don't appear at all in train.

BTW, when you divide the training set and validation set, will there be the same faces in both sets?

ghost commented 6 months ago

I've trained on big dataset it could be around 60TB and that would not be a problem when it doesn't have the same faces. it depends how big your data. You have to balance your data. If the dataset has few faces, it can only learn those faces, and by dividing it into different faces, it cannot learn. To conclude, if you really want your model to be good and general, you must have a large enough amount of data with diverse faces or at least it is similar to lrs2.

ChengsongLu commented 6 months ago

Thanks for providing the information. I think the main reason is that the number of faces in my dataset is too low. 200faces may be not enough to tarin a general model.

Can you share some of your dataset stats? For example, total duration, number of videos after clip, etc.

Thanks a lot.

ChengsongLu commented 6 months ago

Did you do the sync correction on the whole 60TB dataset? And did you do that before or after video clipping?

Nyquist0 commented 6 months ago

I met the same problem. The val dataset is random split from the whole dataset, so it should be the same domain with training dataset. But different from your training, the val loss is decreasing, but with a really small speed, comparing with the training loss.

Currently, the training loss is around 0.36, while the val loss is around 0.56, after ~50w steps. I am using CMLR dataset for training.

I am considering adding some dropout methods to avoid the overfit. But the training is really slow and it cost days to see the results.

ChengsongLu commented 6 months ago

train_sync

I split data into 3 subsets, the valid set is in the same domain with train set, and the test data is the out of domain set (which means contains faces that never apear in train and valid set)

In the last epoch, train loss = 0.22340, val loss = 0.28611, test loss = 1.53218

Nyquist0 commented 6 months ago

Referring to your training progress, I think that is expected. Because of the domain gap of training dataset and evaluation dataset, the bigger gap are they, the bigger difference of their loss value would be got.

ChengsongLu commented 6 months ago

Yep, I got too less faces to make the model be general.

Nyquist0 commented 6 months ago

May I ask the what dataset are you using? Is it a private dataset?

BTW, I found what is my oversight of training. The eval dataset is not random split from the training dataset, so there are still some domain gap between the train and eval dataset. I think both of us should consider

  1. Increase the dataset size (diversity).
  2. Or decrease the network size.
  3. using some other avoid-overfit methods.
  4. Using some pretrained model to decrease the difficulty of training network.
Nyquist0 commented 6 months ago

I would like to consult another question. @primepake

Although the training loss is decreasing, the out_of_sync_distance seems not very large. Is this expected?

2024-01-08 02:15:09,310 - train - INFO - Step 482945 | out_of_sync_distance: 0.44087046 | Loss: 0.36827247 | Elapsed: 0.39497
2024-01-08 02:15:09,906 - train - INFO - Step 482946 | out_of_sync_distance: 0.45174262 | Loss: 0.36818710 | Elapsed: 0.41893
2024-01-08 02:15:10,527 - train - INFO - Step 482947 | out_of_sync_distance: 0.49066639 | Loss: 0.36837881 | Elapsed: 0.44272
2024-01-08 02:15:11,140 - train - INFO - Step 482948 | out_of_sync_distance: 0.53464395 | Loss: 0.36847549 | Elapsed: 0.41779
2024-01-08 02:15:11,728 - train - INFO - Step 482949 | out_of_sync_distance: 0.50022531 | Loss: 0.36851910 | Elapsed: 0.41053
2024-01-08 02:15:12,313 - train - INFO - Step 482950 | out_of_sync_distance: 0.49324411 | Loss: 0.36885640 | Elapsed: 0.40762
2024-01-08 02:15:12,937 - train - INFO - Step 482951 | out_of_sync_distance: 0.51483035 | Loss: 0.36897288 | Elapsed: 0.42812
2024-01-08 02:15:13,590 - train - INFO - Step 482952 | out_of_sync_distance: 0.55414402 | Loss: 0.36898995 | Elapsed: 0.47068
2024-01-08 02:15:14,288 - train - INFO - Step 482953 | out_of_sync_distance: 0.49771458 | Loss: 0.36888830 | Elapsed: 0.48726
2024-01-08 02:15:14,880 - train - INFO - Step 482954 | out_of_sync_distance: 0.47983459 | Loss: 0.36886459 | Elapsed: 0.45544
2024-01-08 02:15:15,488 - train - INFO - Step 482955 | out_of_sync_distance: 0.44912469 | Loss: 0.36901807 | Elapsed: 0.46548
2024-01-08 02:15:16,288 - train - INFO - Step 482956 | out_of_sync_distance: 0.54473352 | Loss: 0.36891696 | Elapsed: 0.65906
2024-01-08 02:15:16,839 - train - INFO - Step 482957 | out_of_sync_distance: 0.54054236 | Loss: 0.36872990 | Elapsed: 0.41629
2024-01-08 02:15:17,423 - train - INFO - Step 482958 | out_of_sync_distance: 0.49879405 | Loss: 0.36882858 | Elapsed: 0.40410
2024-01-08 02:15:18,001 - train - INFO - Step 482959 | out_of_sync_distance: 0.43533319 | Loss: 0.36882154 | Elapsed: 0.39902
2024-01-08 02:15:18,630 - train - INFO - Step 482960 | out_of_sync_distance: 0.44490403 | Loss: 0.36876711 | Elapsed: 0.43631
2024-01-08 02:15:19,206 - train - INFO - Step 482961 | out_of_sync_distance: 0.46191770 | Loss: 0.36872503 | Elapsed: 0.38236
2024-01-08 02:15:19,787 - train - INFO - Step 482962 | out_of_sync_distance: 0.46926844 | Loss: 0.36858617 | Elapsed: 0.40335
2024-01-08 02:15:20,390 - train - INFO - Step 482963 | out_of_sync_distance: 0.55271339 | Loss: 0.36879719 | Elapsed: 0.45285
2024-01-08 02:15:20,892 - train - INFO - Step 482964 | out_of_sync_distance: 0.44827297 | Loss: 0.36875343 | Elapsed: 0.37925
2024-01-08 02:15:21,429 - train - INFO - Step 482965 | out_of_sync_distance: 0.55192006 | Loss: 0.36897230 | Elapsed: 0.40055
2024-01-08 02:15:22,027 - train - INFO - Step 482966 | out_of_sync_distance: 0.47178933 | Loss: 0.36908765 | Elapsed: 0.42047
2024-01-08 02:15:22,537 - train - INFO - Step 482967 | out_of_sync_distance: 0.48556980 | Loss: 0.36902181 | Elapsed: 0.33365
2024-01-08 02:15:23,144 - train - INFO - Step 482968 | out_of_sync_distance: 0.50138468 | Loss: 0.36907162 | Elapsed: 0.44610
2024-01-08 02:15:23,694 - train - INFO - Step 482969 | out_of_sync_distance: 0.47587526 | Loss: 0.36915298 | Elapsed: 0.38607
2024-01-08 02:15:24,289 - train - INFO - Step 482970 | out_of_sync_distance: 0.49232754 | Loss: 0.36918712 | Elapsed: 0.39554
2024-01-08 02:15:24,840 - train - INFO - Step 482971 | out_of_sync_distance: 0.53795362 | Loss: 0.36918593 | Elapsed: 0.41125
2024-01-08 02:15:25,403 - train - INFO - Step 482972 | out_of_sync_distance: 0.52411824 | Loss: 0.36933559 | Elapsed: 0.39844
2024-01-08 02:15:25,988 - train - INFO - Step 482973 | out_of_sync_distance: 0.52927971 | Loss: 0.36940204 | Elapsed: 0.40476
2024-01-08 02:15:26,526 - train - INFO - Step 482974 | out_of_sync_distance: 0.44926518 | Loss: 0.36911347 | Elapsed: 0.40759
2024-01-08 02:15:27,143 - train - INFO - Step 482975 | out_of_sync_distance: 0.45972806 | Loss: 0.36935876 | Elapsed: 0.44226
2024-01-08 02:15:27,717 - train - INFO - Step 482976 | out_of_sync_distance: 0.50330514 | Loss: 0.36934577 | Elapsed: 0.38913
2024-01-08 02:15:28,334 - train - INFO - Step 482977 | out_of_sync_distance: 0.49505723 | Loss: 0.36923314 | Elapsed: 0.42973
2024-01-08 02:15:28,888 - train - INFO - Step 482978 | out_of_sync_distance: 0.47780257 | Loss: 0.36924535 | Elapsed: 0.38769
ChengsongLu commented 6 months ago

I am using a custom dataset, contains around 20K clipped video (3s each) with around 200 difference speakers.

  1. Increasing training data is definitely helpful if we can.
  2. I have rewritten a network with 5,591,616 parameters only (1/3 of the origin), but the result is similar.
  3. Both Dropout and WeightDecay did not help a lot in my experience
  4. Using pretrained model may help, if you can find one (in Chinese).
Nyquist0 commented 6 months ago

I think you could use deep speech or wav2vec for audio encoder. I think there should be Chinese pretrained model. As for video encoder, I guess language is not much important. I am considering trying that by myself.

So if you have some experiment result, we could discuss here.

ChengsongLu commented 6 months ago

I think you could use deep speech or wav2vec for audio encoder. I think there should be Chinese pretrained model. As for video encoder, I guess language is not much important. I am considering trying that by myself.

So if you have some experiment result, we could discuss here.

@Nyquist0 Hey, any luck with the pretrained audio/video encoders?

Nyquist0 commented 6 months ago

Writing code... I choosed audio encoder from AVHubert. You ?

ChengsongLu commented 6 months ago

Writing code... I choosed audio encoder from AVHubert. You ?

Not good, I think the key point to make this model work is the amount and the diversity of data. What i am doing now is colecting more data before next training.

Anyway, I am also curious if the pretrained encoders could help to accelerate the whole pipeline, i.e. makes it easier to get a powerful lip-sync detection system.

huangxin168 commented 6 months ago

Reference in n

your dataset is just only 16hours, it's far from enough. try adding more data...

ChengsongLu commented 6 months ago

Reference in n

your dataset is just only 16hours, it's far from enough. try adding more data...

Are you planning to share or sell some data of yours?

yangppy commented 5 months ago

我想问怎么对数据集进行同步校正?

ChengsongLu commented 4 months ago

Writing code... I choosed audio encoder from AVHubert. You ?

Does the pretrained audio/video encoders work well?

Nyquist0 commented 4 months ago

Does the pretrained audio/video encoders work well?

Hi @ChengsongLu , sorry for the late reply. It helps not much. I think what matters in this task is the video data, because the generation network needs not only to render a realistic img, but to make the lip correct. So I am assuming the audio part, like audio encoder is not really important.

And I agree with you on that the diversity of dataset is really important.. I am using a collected dataset in 15 hours. And it is carefully cleaned.

At that case, I found the syncnet could be easily converged to 0.3 after days training on the resolution of 384. I am training the generation net now. Hope it works well.

ChengsongLu commented 4 months ago

Thanks for the information.

Did you use the pretrained audio encoder in this syncnet training that converged to 0.3? And how did split the dataset, are you using the out-of-domain data as validation set?

The lowest loss on my OOD dataset is only about 0.5, and I think I have more than 15hrs data in the training set. Could I know how many IDs (different person) you have in the whole data set?

wvinzh commented 4 months ago

I met the same problem, I trained the syncnet on our collected data and the trained loss can converge to ~0.3, but the eval loss is 0.5~0.6. When training more steps, the eval loss would get higher. The train and val are split by person ids, so they do not overlap. I also tried on two public datasets such as HDTF and vox, the situation is the same. I also tried pretrained audio encoder, but it did not help much. So what do you think is the main reason for such a situation? @ChengsongLu @primepake

ChengsongLu commented 4 months ago

I met the same problem, I trained the syncnet on our collected data and the trained loss can converge to ~0.3, but the eval loss is 0.5~0.6. When training more steps, the eval loss would get higher. The train and val are split by person ids, so they do not overlap. I also tried on two public datasets such as HDTF and vox, the situation is the same. I also tried pretrained audio encoder, but it did not help much. So what do you think is the main reason for such a situation? @ChengsongLu @primepake

Unfortunately, I haven't found a solution to this problem yet either, even though I have used about 2000 IDs for training.

sylyt62 commented 4 months ago

Hey @ChengsongLu , I got similar results as yours.. I trained it with CMLR dataset on a 4090 for 7 days.

20240305-151738

jinqinn commented 3 months ago

any update ?

SakuraMaiii commented 3 months ago

any update ?

ChengsongLu commented 3 months ago

A demo of my current progress:

https://github.com/primepake/wav2lip_288x288/assets/61783323/b04fc795-243f-4bcb-83e7-c1225ae4a104

PolarRobin commented 3 months ago

@ChengsongLu Your video doesn't load for me. Does it work for anybody else?

ChengsongLu commented 3 months ago

@ChengsongLu Your video doesn't load for me. Does it work for anybody else?

You might need to download to view the video. I can't get it to load on this page either.

PolarRobin commented 3 months ago

@ChengsongLu Your video doesn't load for me. Does it work for anybody else?

You might need to download to view the video. I can't get it to load on this page either.

How to download, though. When I right-click it only shows me download audio :smile: And it actually is not even runnable on my end :cry:

ChengsongLu commented 3 months ago

2 demos here:

https://drive.google.com/drive/folders/1opiFp6YDX-2HCU2h0ORpJyuG9D6HHEhy?usp=sharing

PolarRobin commented 3 months ago

:+1: great results!

SakuraMaiii commented 3 months ago

2 demos here:

https://drive.google.com/drive/folders/1opiFp6YDX-2HCU2h0ORpJyuG9D6HHEhy?usp=sharing

May I ask how you reduced the loss of syncnet?Increase the dataset or align audio and video?Looking forward to your reply

ChengsongLu commented 3 months ago

2 demos here: https://drive.google.com/drive/folders/1opiFp6YDX-2HCU2h0ORpJyuG9D6HHEhy?usp=sharing

May I ask how you reduced the loss of syncnet?Increase the dataset or align audio and video?Looking forward to your reply

I didn't do the video/audio alignment, cause I found it is helpless but cost much extra time (but only if your data audio video deviation is not visible to the naked eye).

And, the loss on my OOD dataset is only around 0.5, I havn't fix the overfitting problem.

ChengsongLu commented 3 months ago

loss sim_neg sim_pos

Here are my loss curver, positive and neagetive samples similarity change curvers. I sample a batch of pos and neg samples at each step.

From the curvers you can see, the main reason of overfitting is the pos similarity. (the y-labels of 2nd and 3rd plots are similarity, not loss)

I tried giving the pos a larger weigth and things did ease up, but the problem was still not solved. However, I just ended up using the 300th epoch of ckpt for the second stage (wav2lip) of training.

drakitLiu commented 2 months ago

这里有 2 个演示:

https://drive.google.com/drive/folders/1opiFp6YDX-2HCU2h0ORpJyuG9D6HHEhy?usp=sharing Thanks for your sharing! And may i have your wechat ID ?

jibingyangsf commented 1 month ago

May I ask the what dataset are you using? Is it a private dataset?

BTW, I found what is my oversight of training. The eval dataset is not random split from the training dataset, so there are still some domain gap between the train and eval dataset. I think both of us should consider

  1. Increase the dataset size (diversity).
  2. Or decrease the network size.
  3. using some other avoid-overfit methods.
  4. Using some pretrained model to decrease the difficulty of training network. "Can we add each other on WeChat?""I have a few questions I would like to ask you."
jibingyangsf commented 1 month ago

I didn't do the video/audio alignment, cause I found it is helpless but cost much extra time (but only if your data audio video deviation is not visible to the naked eye).

And, the loss on my OOD dataset is only around 0.5, I havn't fix the overfitting problem. "Can we add each other on WeChat?""I have a few questions I would like to ask you." 大佬 你用这套源码 没有做任何修改吗?