Open ChengsongLu opened 6 months ago
FYI.
I am using bs=128, lr=1e-4, around 20K clipped video (3s each) with around 200 difference speakers.
I have read some issues in this repo said that sync_correlate could help, but I don't think it will in this situation. If the problem is the dataset itself, the training loss should not going down either?
this is weird issue, can you check your val dataset?
My val_set is just the random split of the whole dataset. It contains faces that haven't appeared in the training set.
I think you should carefully check your val dataset, maybe it out of sync or the distribution of val data is big different with train data
To reflect the generalization of the model, should not the distribution of the val data be different from the train data? And I think that's what my current data distribution is, because val data use faces that don't appear at all in train.
BTW, when you divide the training set and validation set, will there be the same faces in both sets?
I've trained on big dataset it could be around 60TB and that would not be a problem when it doesn't have the same faces. it depends how big your data. You have to balance your data. If the dataset has few faces, it can only learn those faces, and by dividing it into different faces, it cannot learn. To conclude, if you really want your model to be good and general, you must have a large enough amount of data with diverse faces or at least it is similar to lrs2.
Thanks for providing the information. I think the main reason is that the number of faces in my dataset is too low. 200faces may be not enough to tarin a general model.
Can you share some of your dataset stats? For example, total duration, number of videos after clip, etc.
Thanks a lot.
Did you do the sync correction on the whole 60TB dataset? And did you do that before or after video clipping?
I met the same problem. The val dataset is random split from the whole dataset, so it should be the same domain with training dataset. But different from your training, the val loss is decreasing, but with a really small speed, comparing with the training loss.
Currently, the training loss is around 0.36, while the val loss is around 0.56, after ~50w steps. I am using CMLR dataset for training.
I am considering adding some dropout methods to avoid the overfit. But the training is really slow and it cost days to see the results.
I split data into 3 subsets, the valid set is in the same domain with train set, and the test data is the out of domain set (which means contains faces that never apear in train and valid set)
In the last epoch, train loss = 0.22340, val loss = 0.28611, test loss = 1.53218
Referring to your training progress, I think that is expected. Because of the domain gap of training dataset and evaluation dataset, the bigger gap are they, the bigger difference of their loss value would be got.
Yep, I got too less faces to make the model be general.
May I ask the what dataset are you using? Is it a private dataset?
BTW, I found what is my oversight of training. The eval dataset is not random split from the training dataset, so there are still some domain gap between the train and eval dataset. I think both of us should consider
I would like to consult another question. @primepake
Although the training loss is decreasing, the out_of_sync_distance seems not very large. Is this expected?
2024-01-08 02:15:09,310 - train - INFO - Step 482945 | out_of_sync_distance: 0.44087046 | Loss: 0.36827247 | Elapsed: 0.39497
2024-01-08 02:15:09,906 - train - INFO - Step 482946 | out_of_sync_distance: 0.45174262 | Loss: 0.36818710 | Elapsed: 0.41893
2024-01-08 02:15:10,527 - train - INFO - Step 482947 | out_of_sync_distance: 0.49066639 | Loss: 0.36837881 | Elapsed: 0.44272
2024-01-08 02:15:11,140 - train - INFO - Step 482948 | out_of_sync_distance: 0.53464395 | Loss: 0.36847549 | Elapsed: 0.41779
2024-01-08 02:15:11,728 - train - INFO - Step 482949 | out_of_sync_distance: 0.50022531 | Loss: 0.36851910 | Elapsed: 0.41053
2024-01-08 02:15:12,313 - train - INFO - Step 482950 | out_of_sync_distance: 0.49324411 | Loss: 0.36885640 | Elapsed: 0.40762
2024-01-08 02:15:12,937 - train - INFO - Step 482951 | out_of_sync_distance: 0.51483035 | Loss: 0.36897288 | Elapsed: 0.42812
2024-01-08 02:15:13,590 - train - INFO - Step 482952 | out_of_sync_distance: 0.55414402 | Loss: 0.36898995 | Elapsed: 0.47068
2024-01-08 02:15:14,288 - train - INFO - Step 482953 | out_of_sync_distance: 0.49771458 | Loss: 0.36888830 | Elapsed: 0.48726
2024-01-08 02:15:14,880 - train - INFO - Step 482954 | out_of_sync_distance: 0.47983459 | Loss: 0.36886459 | Elapsed: 0.45544
2024-01-08 02:15:15,488 - train - INFO - Step 482955 | out_of_sync_distance: 0.44912469 | Loss: 0.36901807 | Elapsed: 0.46548
2024-01-08 02:15:16,288 - train - INFO - Step 482956 | out_of_sync_distance: 0.54473352 | Loss: 0.36891696 | Elapsed: 0.65906
2024-01-08 02:15:16,839 - train - INFO - Step 482957 | out_of_sync_distance: 0.54054236 | Loss: 0.36872990 | Elapsed: 0.41629
2024-01-08 02:15:17,423 - train - INFO - Step 482958 | out_of_sync_distance: 0.49879405 | Loss: 0.36882858 | Elapsed: 0.40410
2024-01-08 02:15:18,001 - train - INFO - Step 482959 | out_of_sync_distance: 0.43533319 | Loss: 0.36882154 | Elapsed: 0.39902
2024-01-08 02:15:18,630 - train - INFO - Step 482960 | out_of_sync_distance: 0.44490403 | Loss: 0.36876711 | Elapsed: 0.43631
2024-01-08 02:15:19,206 - train - INFO - Step 482961 | out_of_sync_distance: 0.46191770 | Loss: 0.36872503 | Elapsed: 0.38236
2024-01-08 02:15:19,787 - train - INFO - Step 482962 | out_of_sync_distance: 0.46926844 | Loss: 0.36858617 | Elapsed: 0.40335
2024-01-08 02:15:20,390 - train - INFO - Step 482963 | out_of_sync_distance: 0.55271339 | Loss: 0.36879719 | Elapsed: 0.45285
2024-01-08 02:15:20,892 - train - INFO - Step 482964 | out_of_sync_distance: 0.44827297 | Loss: 0.36875343 | Elapsed: 0.37925
2024-01-08 02:15:21,429 - train - INFO - Step 482965 | out_of_sync_distance: 0.55192006 | Loss: 0.36897230 | Elapsed: 0.40055
2024-01-08 02:15:22,027 - train - INFO - Step 482966 | out_of_sync_distance: 0.47178933 | Loss: 0.36908765 | Elapsed: 0.42047
2024-01-08 02:15:22,537 - train - INFO - Step 482967 | out_of_sync_distance: 0.48556980 | Loss: 0.36902181 | Elapsed: 0.33365
2024-01-08 02:15:23,144 - train - INFO - Step 482968 | out_of_sync_distance: 0.50138468 | Loss: 0.36907162 | Elapsed: 0.44610
2024-01-08 02:15:23,694 - train - INFO - Step 482969 | out_of_sync_distance: 0.47587526 | Loss: 0.36915298 | Elapsed: 0.38607
2024-01-08 02:15:24,289 - train - INFO - Step 482970 | out_of_sync_distance: 0.49232754 | Loss: 0.36918712 | Elapsed: 0.39554
2024-01-08 02:15:24,840 - train - INFO - Step 482971 | out_of_sync_distance: 0.53795362 | Loss: 0.36918593 | Elapsed: 0.41125
2024-01-08 02:15:25,403 - train - INFO - Step 482972 | out_of_sync_distance: 0.52411824 | Loss: 0.36933559 | Elapsed: 0.39844
2024-01-08 02:15:25,988 - train - INFO - Step 482973 | out_of_sync_distance: 0.52927971 | Loss: 0.36940204 | Elapsed: 0.40476
2024-01-08 02:15:26,526 - train - INFO - Step 482974 | out_of_sync_distance: 0.44926518 | Loss: 0.36911347 | Elapsed: 0.40759
2024-01-08 02:15:27,143 - train - INFO - Step 482975 | out_of_sync_distance: 0.45972806 | Loss: 0.36935876 | Elapsed: 0.44226
2024-01-08 02:15:27,717 - train - INFO - Step 482976 | out_of_sync_distance: 0.50330514 | Loss: 0.36934577 | Elapsed: 0.38913
2024-01-08 02:15:28,334 - train - INFO - Step 482977 | out_of_sync_distance: 0.49505723 | Loss: 0.36923314 | Elapsed: 0.42973
2024-01-08 02:15:28,888 - train - INFO - Step 482978 | out_of_sync_distance: 0.47780257 | Loss: 0.36924535 | Elapsed: 0.38769
I am using a custom dataset, contains around 20K clipped video (3s each) with around 200 difference speakers.
I think you could use deep speech or wav2vec for audio encoder. I think there should be Chinese pretrained model. As for video encoder, I guess language is not much important. I am considering trying that by myself.
So if you have some experiment result, we could discuss here.
I think you could use deep speech or wav2vec for audio encoder. I think there should be Chinese pretrained model. As for video encoder, I guess language is not much important. I am considering trying that by myself.
So if you have some experiment result, we could discuss here.
@Nyquist0 Hey, any luck with the pretrained audio/video encoders?
Writing code... I choosed audio encoder from AVHubert. You ?
Writing code... I choosed audio encoder from AVHubert. You ?
Not good, I think the key point to make this model work is the amount and the diversity of data. What i am doing now is colecting more data before next training.
Anyway, I am also curious if the pretrained encoders could help to accelerate the whole pipeline, i.e. makes it easier to get a powerful lip-sync detection system.
Reference in n
your dataset is just only 16hours, it's far from enough. try adding more data...
Reference in n
your dataset is just only 16hours, it's far from enough. try adding more data...
Are you planning to share or sell some data of yours?
我想问怎么对数据集进行同步校正?
Writing code... I choosed audio encoder from AVHubert. You ?
Does the pretrained audio/video encoders work well?
Does the pretrained audio/video encoders work well?
Hi @ChengsongLu , sorry for the late reply. It helps not much. I think what matters in this task is the video data, because the generation network needs not only to render a realistic img, but to make the lip correct. So I am assuming the audio part, like audio encoder is not really important.
And I agree with you on that the diversity of dataset is really important.. I am using a collected dataset in 15 hours. And it is carefully cleaned.
At that case, I found the syncnet could be easily converged to 0.3 after days training on the resolution of 384. I am training the generation net now. Hope it works well.
Thanks for the information.
Did you use the pretrained audio encoder in this syncnet training that converged to 0.3? And how did split the dataset, are you using the out-of-domain data as validation set?
The lowest loss on my OOD dataset is only about 0.5, and I think I have more than 15hrs data in the training set. Could I know how many IDs (different person) you have in the whole data set?
I met the same problem, I trained the syncnet on our collected data and the trained loss can converge to ~0.3, but the eval loss is 0.5~0.6. When training more steps, the eval loss would get higher. The train and val are split by person ids, so they do not overlap. I also tried on two public datasets such as HDTF and vox, the situation is the same. I also tried pretrained audio encoder, but it did not help much. So what do you think is the main reason for such a situation? @ChengsongLu @primepake
I met the same problem, I trained the syncnet on our collected data and the trained loss can converge to ~0.3, but the eval loss is 0.5~0.6. When training more steps, the eval loss would get higher. The train and val are split by person ids, so they do not overlap. I also tried on two public datasets such as HDTF and vox, the situation is the same. I also tried pretrained audio encoder, but it did not help much. So what do you think is the main reason for such a situation? @ChengsongLu @primepake
Unfortunately, I haven't found a solution to this problem yet either, even though I have used about 2000 IDs for training.
Hey @ChengsongLu , I got similar results as yours.. I trained it with CMLR dataset on a 4090 for 7 days.
any update ?
any update ?
A demo of my current progress:
https://github.com/primepake/wav2lip_288x288/assets/61783323/b04fc795-243f-4bcb-83e7-c1225ae4a104
@ChengsongLu Your video doesn't load for me. Does it work for anybody else?
@ChengsongLu Your video doesn't load for me. Does it work for anybody else?
You might need to download to view the video. I can't get it to load on this page either.
@ChengsongLu Your video doesn't load for me. Does it work for anybody else?
You might need to download to view the video. I can't get it to load on this page either.
How to download, though. When I right-click it only shows me download audio :smile: And it actually is not even runnable on my end :cry:
:+1: great results!
2 demos here:
https://drive.google.com/drive/folders/1opiFp6YDX-2HCU2h0ORpJyuG9D6HHEhy?usp=sharing
May I ask how you reduced the loss of syncnet?Increase the dataset or align audio and video?Looking forward to your reply
2 demos here: https://drive.google.com/drive/folders/1opiFp6YDX-2HCU2h0ORpJyuG9D6HHEhy?usp=sharing
May I ask how you reduced the loss of syncnet?Increase the dataset or align audio and video?Looking forward to your reply
I didn't do the video/audio alignment, cause I found it is helpless but cost much extra time (but only if your data audio video deviation is not visible to the naked eye).
And, the loss on my OOD dataset is only around 0.5, I havn't fix the overfitting problem.
Here are my loss curver, positive and neagetive samples similarity change curvers. I sample a batch of pos and neg samples at each step.
From the curvers you can see, the main reason of overfitting is the pos similarity. (the y-labels of 2nd and 3rd plots are similarity, not loss)
I tried giving the pos a larger weigth and things did ease up, but the problem was still not solved. However, I just ended up using the 300th epoch of ckpt for the second stage (wav2lip) of training.
这里有 2 个演示:
https://drive.google.com/drive/folders/1opiFp6YDX-2HCU2h0ORpJyuG9D6HHEhy?usp=sharing Thanks for your sharing! And may i have your wechat ID ?
May I ask the what dataset are you using? Is it a private dataset?
BTW, I found what is my oversight of training. The eval dataset is not random split from the training dataset, so there are still some domain gap between the train and eval dataset. I think both of us should consider
- Increase the dataset size (diversity).
- Or decrease the network size.
- using some other avoid-overfit methods.
- Using some pretrained model to decrease the difficulty of training network. "Can we add each other on WeChat?""I have a few questions I would like to ask you."
I didn't do the video/audio alignment, cause I found it is helpless but cost much extra time (but only if your data audio video deviation is not visible to the naked eye).
And, the loss on my OOD dataset is only around 0.5, I havn't fix the overfitting problem. "Can we add each other on WeChat?""I have a few questions I would like to ask you." 大佬 你用这套源码 没有做任何修改吗?
Appreciate for sharing such great project!
I'm having an issue training syncnet with my own dataset.
As shown above, the network suffers from severe overfitting. Have you encountered a similar situation? If so, how did you resolve it?
Thanks!