primepake / wav2lip_288x288

MIT License
524 stars 135 forks source link

Why my train loss after introducing sync loss? #140

Open Marskly opened 2 months ago

Marskly commented 2 months ago

image After introducin at Step 250000, the L1 Loss, Vgg Loss, Percep are all increasing. It is because taht the loss of sync is too big? And it influences the weights of model?

see2run commented 2 months ago

Hey, can you share what you do from dataset preparation to running the script train_syncnet_sam.py? Because I've been trying and the output result is just stuck like this without any progress:

(w2l_cek) vian:~/wav2lip_288x288$ python3 train_syncnet_sam.py use_cuda: True total trainable params 65054464 Training From Scratch !!! Starting Epoch: 0

Marskly commented 1 month ago

Hey, can you share what you do from dataset preparation to running the script train_syncnet_sam.py? Because I've been trying and the output result is just stuck like this without any progress:

(w2l_cek) vian:~/wav2lip_288x288$ python3 train_syncnet_sam.py use_cuda: True total trainable params 65054464 Training From Scratch !!! Starting Epoch: 0

Maybe your CPU loads data too slowly. You can monitor your CPU utilization and GPU memory. Try smaller batch size.

Liming-belief commented 1 month ago

Hello, I have encountered the same problem as you. Have you resolved it @Marskly

see2run commented 1 month ago

Hey, can you share what you do from dataset preparation to running the script train_syncnet_sam.py? Because I've been trying and the output result is just stuck like this without any progress: (w2l_cek) vian:~/wav2lip_288x288$ python3 train_syncnet_sam.py use_cuda: True total trainable params 65054464 Training From Scratch !!! Starting Epoch: 0

Maybe your CPU loads data too slowly. You can monitor your CPU utilization and GPU memory. Try smaller batch size.

Okay, I have solved it, thank you, and now when training, the results are as follows:

Step 259 | L1: 0.08976 | Vgg: 0.3718 | SW: 0.03 | Sync: 0.0 | DW: 0.0 | Percep: 0.0 | Fake: 0.0, Real: 0.0 | Load: 0.01096, Train: 1.225

where Percep, Fake, and Real are always 0.0. Can you provide any suggestions? I am training with 1725 videos