Open hnsywangxin opened 5 days ago
the loss should be BCE instead of MSE loss. also can you provide the code?
the loss should be BCE instead of MSE loss. also can you provide the code?
Thanks for your reply, I used BCE loss, but the result is same. I only changed syncnet.py , other files is same with your repo, and my hubert features from meta's hubert offical repo, my syncnet as follow:
class SyncNet_color_hubert(nn.Module):
def __init__(self):
super(SyncNet_color_hubert, self).__init__()
self.face_encoder = nn.Sequential(
Conv2d(15, 16, kernel_size=(7, 7), stride=1, padding=3, act="leaky"), # 192, 384
Conv2d(16, 32, kernel_size=5, stride=(1, 2), padding=1, act="leaky"), # 192, 192
Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(32, 64, kernel_size=3, stride=2, padding=1, act="leaky"), # 96, 96
Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(64, 128, kernel_size=3, stride=2, padding=1, act="leaky"), # 48, 48
Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(128, 256, kernel_size=3, stride=2, padding=1, act="leaky"), # 24, 24
Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(256, 512, kernel_size=3, stride=2, padding=1, act="leaky"),
Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"), # 12, 12
Conv2d(512, 1024, kernel_size=3, stride=2, padding=1, act="leaky"),
Conv2d(1024, 1024, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
Conv2d(1024, 1024, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"), # 6, 6
Conv2d(1024, 1024, kernel_size=3, stride=2, padding=1, act="leaky"), # 3, 3
Conv2d(1024, 1024, kernel_size=3, stride=1, padding=0, act="leaky"),
Conv2d(1024, 1024, kernel_size=1, stride=1, padding=0, act="relu")) # 1, 1
self.audio_encoder = nn.Sequential(
SameBlock1d(1024, 1024, kernel_size=7, padding=3), # 10
ResBlock1d(1024, 1024, 3, 1),
# 9-5
DownBlock1d(1024, 1024, 3, 1), # 5
ResBlock1d(1024, 1024, 3, 1),
# 5 -3
DownBlock1d(1024, 1024, 3, 1),
ResBlock1d(1024, 1024, 3, 1),
# 3-2
DownBlock1d(1024, 1024, 3, 1),
SameBlock1d(1024, 1024, kernel_size=3, padding=1)
)
self.global_avg1d = nn.AdaptiveAvgPool1d(1)
def forward(self, audio_sequences, face_sequences): # audio_sequences := (B, dim, T)
face_embedding = self.face_encoder(face_sequences)
audio_sequences = audio_sequences.permute(0, 2, 1)
audio_embedding = self.audio_encoder(audio_sequences) # audio_embedding: (8,1024,1)
audio_embedding = self.global_avg1d(audio_embedding).unsqueeze(2)
audio_embedding = audio_embedding.view(audio_embedding.size(0), -1)
face_embedding = face_embedding.view(face_embedding.size(0), -1)
# audio_embedding = F.normalize(audio_embedding, p=2, dim=1)
face_embedding = F.normalize(face_embedding, p=2, dim=1)
return audio_embedding, face_embedding
ResBlock1d and DownBlock1d refer to DInet:https://github.com/MRzzm/DINet/blob/3b57fb0a2482213327890fbb76baeafdaa412597/models/Syncnet.py#L3 and https://github.com/MRzzm/DINet/blob/3b57fb0a2482213327890fbb76baeafdaa412597/models/Syncnet.py#L55 thanks again
I have replaced the mel spectrogram with HuBERT features to train wav2lip, and it runs through, but when training SyncNet, the loss keeps hovering around 0.69 and won't go down. It can be reduced with mel spectrograms. I would like to ask for help to see what the problem might be.
1: The face encoding dimension of wav2lip is (8, 1024, 1, 1), where 8 represents the batch size. However, the feature dimension of HuBERT that I use is (8, 1024, 10). The input dimension of mel is (8, 1, 80, 16), and after convolution, it becomes (8, 1024, 1, 1), which can be trained normally. Therefore, I first use permute to perform dimension conversion, and then use Conv1D convolution to reduce the last dimension, ultimately obtaining (8, 1024, 1, 1). The code is as follows:
2:audio_encoder code:
And I also modified the network to make it deeper, but it still didn't work. the new network as follows:
I also change BCEloss to MSEloss, but loss does not converge! can you help me , thanks!