Use HuBERT features to train SyncNet, the loss does not converge.

hnsywangxin commented 5 days ago

I have replaced the mel spectrogram with HuBERT features to train wav2lip, and it runs through, but when training SyncNet, the loss keeps hovering around 0.69 and won't go down. It can be reduced with mel spectrograms. I would like to ask for help to see what the problem might be.

1: The face encoding dimension of wav2lip is (8, 1024, 1, 1), where 8 represents the batch size. However, the feature dimension of HuBERT that I use is (8, 1024, 10). The input dimension of mel is (8, 1, 80, 16), and after convolution, it becomes (8, 1024, 1, 1), which can be trained normally. Therefore, I first use permute to perform dimension conversion, and then use Conv1D convolution to reduce the last dimension, ultimately obtaining (8, 1024, 1, 1). The code is as follows:

2：audio_encoder code:

And I also modified the network to make it deeper, but it still didn't work. the new network as follows:

I also change BCEloss to MSEloss, but loss does not converge! can you help me , thanks!

primepake commented 3 days ago

the loss should be BCE instead of MSE loss. also can you provide the code?

hnsywangxin commented 3 days ago

the loss should be BCE instead of MSE loss. also can you provide the code?

Thanks for your reply, I used BCE loss, but the result is same. I only changed syncnet.py , other files is same with your repo, and my hubert features from meta's hubert offical repo, my syncnet as follow:

class SyncNet_color_hubert(nn.Module):
    def __init__(self):
        super(SyncNet_color_hubert, self).__init__()

        self.face_encoder = nn.Sequential(
            Conv2d(15, 16, kernel_size=(7, 7), stride=1, padding=3, act="leaky"),  # 192, 384

            Conv2d(16, 32, kernel_size=5, stride=(1, 2), padding=1, act="leaky"),  # 192, 192
            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(32, 32, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),

            Conv2d(32, 64, kernel_size=3, stride=2, padding=1, act="leaky"),  # 96, 96
            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(64, 64, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),

            Conv2d(64, 128, kernel_size=3, stride=2, padding=1, act="leaky"),  # 48, 48
            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(128, 128, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),

            Conv2d(128, 256, kernel_size=3, stride=2, padding=1, act="leaky"),  # 24, 24
            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(256, 256, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),

            Conv2d(256, 512, kernel_size=3, stride=2, padding=1, act="leaky"),
            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(512, 512, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),  # 12, 12

            Conv2d(512, 1024, kernel_size=3, stride=2, padding=1, act="leaky"),
            Conv2d(1024, 1024, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),
            Conv2d(1024, 1024, kernel_size=3, stride=1, padding=1, residual=True, act="leaky"),  # 6, 6

            Conv2d(1024, 1024, kernel_size=3, stride=2, padding=1, act="leaky"),  # 3, 3
            Conv2d(1024, 1024, kernel_size=3, stride=1, padding=0, act="leaky"),
            Conv2d(1024, 1024, kernel_size=1, stride=1, padding=0, act="relu"))  # 1, 1

self.audio_encoder = nn.Sequential(
            SameBlock1d(1024, 1024, kernel_size=7, padding=3), # 10
            ResBlock1d(1024, 1024, 3, 1),
            # 9-5
            DownBlock1d(1024, 1024, 3, 1), # 5
            ResBlock1d(1024, 1024, 3, 1),
            # 5 -3
            DownBlock1d(1024, 1024, 3, 1),
            ResBlock1d(1024, 1024, 3, 1),
            # 3-2
            DownBlock1d(1024, 1024, 3, 1),
            SameBlock1d(1024, 1024, kernel_size=3, padding=1)
        )
        self.global_avg1d = nn.AdaptiveAvgPool1d(1)

    def forward(self, audio_sequences, face_sequences):  # audio_sequences := (B, dim, T)
        face_embedding = self.face_encoder(face_sequences)
        audio_sequences = audio_sequences.permute(0, 2, 1)
        audio_embedding = self.audio_encoder(audio_sequences) # audio_embedding: (8,1024,1)
        audio_embedding = self.global_avg1d(audio_embedding).unsqueeze(2)
        audio_embedding = audio_embedding.view(audio_embedding.size(0), -1)
        face_embedding = face_embedding.view(face_embedding.size(0), -1)

        # audio_embedding = F.normalize(audio_embedding, p=2, dim=1)
        face_embedding = F.normalize(face_embedding, p=2, dim=1)

        return audio_embedding, face_embedding

ResBlock1d and DownBlock1d refer to DInet:https://github.com/MRzzm/DINet/blob/3b57fb0a2482213327890fbb76baeafdaa412597/models/Syncnet.py#L3 and https://github.com/MRzzm/DINet/blob/3b57fb0a2482213327890fbb76baeafdaa412597/models/Syncnet.py#L55 thanks again

primepake / wav2lip_288x288

Use HuBERT features to train SyncNet, the loss does not converge. #150