sstzal / DiffTalk

[CVPR2023] The implementation for "DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation"
440 stars 41 forks source link

The usage of RAM is always increasing during one epoch. #19

Open quqixun opened 1 year ago

quqixun commented 1 year ago

After preprocessing of HDTF dataset, I got 415 videos. 249 videos (60%) were randomly selected as training set, the others (40%) were test set. The first 1500 frames of each video were extracted for training with stride 2. So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap. In training, the batch size is set to 16. At the first epoch, the usage of RAM is always increasing. At step 2743, all RAM was occupied (even the Swap space) and the training stopped. Thus, 2743 16 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account. I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

I guess the reason of this problem is that there are too much log during training.

rjc7011855 commented 1 year ago

Hello, may I ask how the signal features of your audio are extracted

quqixun commented 1 year ago

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

xz0305 commented 1 year ago

hello, How do you get landmarks? please

quqixun commented 1 year ago

@xz0305 It is quite simple to get 68 landmarks using dlib. http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

import cv2
import dlib
import numpy as np

class LandmarksExtractor(object):

    def __init__(self, model_path):

        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(model_path)

    def forward(self, image, is_rgb=True):

        if not is_rgb:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        landmarks = self.__predict(image)

        return landmarks

    def __predict(self, image):
        faces = self.detector(image, 1)
        assert len(faces) > 0
        face = faces[0]
        landmarks = self.predictor(image, face)
        landmarks = self.shape_to_np(landmarks)
        return landmarks

    @staticmethod
    def shape_to_np(shape, dtype=int):
        coords = np.zeros((68, 2), dtype=dtype)
        for i in range(68):
            coords[i] = (shape.part(i).x, shape.part(i).y)
        return coords
xz0305 commented 1 year ago

very thankful

rjc7011855 commented 1 year ago

very thankful

979277 commented 1 year ago

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

您好,想问一下您的复现效果如何,可以交流一下吗

quqixun commented 1 year ago

@979277

训练了一些epoch,下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段,作者只给了训练集的视频名称,没有给测试集的,所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述),经过多次实验,最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频,使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量,使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据,其中一个片段作为测试集,其他视频作为训练集,再训练看看效果。

训练过程中没有验证集,只有测试集,最终的测试效果也是在测试集上观察的,可能有数据泄露的风险。作者应该也是这么搞得。

xz0305 commented 1 year ago

how you downloaded the hdtf data, the video I downloaded has no sound

quqixun commented 1 year ago

@xz0305 You can use youtube-dl or yt-dlp to download videos with best quality both in video and audio channel.

xz0305 commented 1 year ago

@quqixun 您好,这一步是将每一帧的音频保存为npy吗,我这样做生成的特征长度都是0,请问可以讲解一下具体过程吗 1692686778232(1)

quqixun commented 1 year ago

@xz0305 保存的是音频特征,就用的 https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features

Tinaa23 commented 1 year ago

@xz0305 保存的是音频特征,就用的 https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features

Thank you for sharing this link. If the video contains 3000 frames, then using this repo for audio feature extraction returns one .npy file with (3000,16,29) shape. However, for the DiffTalk model, we need a separate .npy file for each frame. Can you please share how can we do this? Thanks

quqixun commented 1 year ago

@Tinaa23

Make (3000,16,29) to (3000, 8, 16, 29). 3000 : number of frames 8 : sequence length for each frame 16 : window size 29 : number of features See https://github.com/sstzal/DiffTalk/issues/10#issuecomment-1641661343 .

Or you can refer the code at https://github.com/miu200521358/NeuralVoicePuppetryMMD/blob/master/Audio2ExpressionNet/Training%20Code/data/audio_dataset.py#L85 , there are two ways to generate the sequence.

979277 commented 1 year ago

@979277

训练了一些epoch,下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段,作者只给了训练集的视频名称,没有给测试集的,所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述),经过多次实验,最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频,使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量,使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据,其中一个片段作为测试集,其他视频作为训练集,再训练看看效果。

训练过程中没有验证集,只有测试集,最终的测试效果也是在测试集上观察的,可能有数据泄露的风险。作者应该也是这么搞得。

想问一下你做了全量测试吗?我做下来发现这个方法似乎对一些训练集没见过的id效果不太好

zyhsuperman commented 1 year ago

请问提取的视频帧和音频帧帧数是对应的吗?我把视频处理成了25fps,截取了前1000帧,这样的话音频应该对应的是40s, 而在16khz的采样率下它共有2400帧,请问应该怎么处理呢

Tinaa23 commented 11 months ago

@979277

训练了一些epoch,下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段,作者只给了训练集的视频名称,没有给测试集的,所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述),经过多次实验,最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频,使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量,使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据,其中一个片段作为测试集,其他视频作为训练集,再训练看看效果。

训练过程中没有验证集,只有测试集,最终的测试效果也是在测试集上观察的,可能有数据泄露的风险。作者应该也是这么搞得。

Hi. I have a basic question and I hope you can help me with it. How can we specify the number of epochs in this code? this model only trains for 1 epoch on my machine.

sstzal commented 9 months ago

音频处理部分沿用了AD-Nerf的操作,使用deepspeech作为音频特征提取器。

我在实验中没有出现内存占用不断增加的情况,如果您能找到问题所在欢迎指出并修正,改动也可以合并到该项目中。

difftalk_demo.zip中的效果看起来还可以。我们在实际应用中还增加了一步后处理操作。具体地,我们使用了[Real-time intermediate flow estimation for video frame interpolation]这一工作进行帧插值,以获得更流畅的视频。

kaiw7 commented 8 months ago

After preprocessing of HDTF dataset, I got 415 videos. 249 videos (60%) were randomly selected as training set, the others (40%) were test set. The first 1500 frames of each video were extracted for training with stride 2. So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap. In training, the batch size is set to 16. At the first epoch, the usage of RAM is always increasing. At step 2743, all RAM was occupied (even the Swap space) and the training stopped. Thus, 2743 16 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account. I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

  • Did you meet the same problem in your training?
  • If so, how did you solve the problem?
  • Is it possible to release the weights of diffusion model?

I guess the reason of this problem is that there are too much log during training.

Hi, could I know whether your downloaded HDTF videos has audio stream? Could you share the downloading link? Many thanks

Utkarsh-shift commented 8 months ago

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

I am getting [x,16,29] where x is the number of frames after deepspeech_features

Utkarsh-shift commented 8 months ago

thanks i got you answer in comment above

kaiw7 commented 7 months ago

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

I am getting [x,16,29] where x is the number of frames after deepspeech_features

Hi, could I know how to download the dataset? I met some issues with dataset downloading. Thank you very much

jinlingxueluo commented 6 months ago

@xz0305 It is quite simple to get 68 landmarks using dlib. http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

import cv2
import dlib
import numpy as np

class LandmarksExtractor(object):

    def __init__(self, model_path):

        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(model_path)

    def forward(self, image, is_rgb=True):

        if not is_rgb:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        landmarks = self.__predict(image)

        return landmarks

    def __predict(self, image):
        faces = self.detector(image, 1)
        assert len(faces) > 0
        face = faces[0]
        landmarks = self.predictor(image, face)
        landmarks = self.shape_to_np(landmarks)
        return landmarks

    @staticmethod
    def shape_to_np(shape, dtype=int):
        coords = np.zeros((68, 2), dtype=dtype)
        for i in range(68):
            coords[i] = (shape.part(i).x, shape.part(i).y)
        return coords

我沿用了AD-nerf的处理方式,您是否会遇到RuntimeError: stack expects each tensor to be equal size, but got [4, 16, 29] at entry 0 and [8, 16, 29] at entry 1 这样的问题呢?

SCP2922 commented 1 month ago

请问在说明中的 |——data/HDTF |——images |——0_0.jpg |——0_1.jpg |——... |——N_M.bin |——landmarks |——0_0.lmd |——0_1.lmd |——... |——N_M.lms |——audio_smooth |——0_0.npy |——0_1.npy |——... |——N_M.npy 0_0.jpg和0_1.jpg代表的是某一个视频的第一帧和第二帧,还是某一个视频分段之后的每一段的第一帧? N_M.bin,N_M.lms储存的是什么信息? 最后的音频文件0_0.jpy与0_0.jpg的对应关系应该是什么?是某一帧内的音频特征,还是某一段内的音频特征? 希望可以有大佬帮忙解惑 感激不尽