pixeli99 / SVD_Xtend

Stable Video Diffusion Training Code and Extensions.
481 stars 45 forks source link

About the GPU settings when you training. #31

Closed yunniw2001 closed 5 months ago

yunniw2001 commented 6 months ago

Hi, your code is very useful!

And I wonder the GPU setting when you are trying. Does it require 80G GPU memory? or some size smaller?

Thanks a lot!

Kiteretsu77 commented 5 months ago

Based on my testing, if you use their default temporal layer-only fine-tuning method, with 320x448 (No CFG), I only need approximately 23GB memory. Thanks!

yunniw2001 commented 5 months ago

Based on my testing, if you use their default temporal layer-only fine-tuning method, with 320x448 (No CFG), I only need approximately 23GB memory. Thanks!

Hi, much thanks for your reply. I test the code on my machine with 40GB VRAM with [height,width] set as [448,320] as you mentioned, but it still reports Cuda out of memory error. I checked the code, it has set only the temporal layer with grad. So how can you finetune temporal layer in 23GB VRAM?

Thank you for your patience and help.

yunniw2001 commented 5 months ago

Ah, I got it. Now I can finetune svd on one GPU with about 31GB VRAM. Thanks for your code and information. All the best!

chenbinghui1 commented 5 months ago

@yunniw2001 Hi what have you done for just requiring 31GB; I tested with 512x320 images, GPU costs is about 52G;

yunniw2001 commented 5 months ago

I modify the code and add lora. Then freeze all the unet parameter, except the lora weight. This can significantly reduce vram demand. But the performance is much worse than fine-tune the original attention weight.. Also, during training, after calling the vae and other modules, I unload them to the CPU, to save VRAM. But I highly recommend to use GPU with larger VRAM. All my modification is due to the limits of my server....

yunniw2001 commented 5 months ago

Ah, also I load unet weight dtype in float16, and only set lora weight dtype to float32, this can also reduce VRAM demand. maybe a trade-off in fine-tuning result, I'm not sure.

chenbinghui1 commented 5 months ago

@yunniw2001 Thanks for your quick reply. Yes, using lora indeed save GPU mem; But I actually want to train the temporal modules. And I found 80G mem is not enough. I also tried to move Unet to fp16, but the following error occurred:

Steps: 0%| | 0/200000 [00:00<?, ?it/s]Traceback (most recent call last): File "/mnt/workspace/Diffusion/svd/ashui_train_svd.py", line 1214, in main() File "/mnt/workspace/Diffusion/svd/ashui_train_svd.py", line 1068, in main optimizer.step() File "/usr/local/lib/python3.8/dist-packages/accelerate/optimizer.py", line 133, in step self.scaler.step(self.optimizer, closure) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/gradscaler.py", line 410, in step self.unscale(optimizer) File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/gradscaler.py", line 307, in unscale optimizer_state["found_inf_per_device"] = self._unscalegrads( File "/usr/local/lib/python3.8/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscalegrads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients.

chenbinghui1 commented 5 months ago

@yunniw2001 Ignore this above error, the forward process also requires much larger GPU mem than 80G in my case. So I guess svd_xt might be trained on other more powerful devices.

rickakkerman commented 4 months ago

I modify the code and add lora. Then freeze all the unet parameter, except the lora weight. This can significantly reduce vram demand. But the performance is much worse than fine-tune the original attention weight.. Also, during training, after calling the vae and other modules, I unload them to the CPU, to save VRAM. But I highly recommend to use GPU with larger VRAM. All my modification is due to the limits of my server....

@yunniw2001 Hey! Would you be able to share your code in which you use lora for finetuning?

christopher-beckham commented 1 week ago

I'm not sure why the unet is being stored in float32, it seems like that line is commented out in the training script.

I can't speak on training stability if you do this, but at least in terms of being able to make training feasible you should be casting the entire unet to torch.float16 and re-casting the trainable parameters to torch.float32. A fine-tuning script which requires an 80GB GPU is quite overkill, IMO.

christopher-beckham commented 1 week ago

Another thing is that the default number of frames used in the script is 25. Based on my tests, generating a 14 frame video from the base model stabilityai/stable-video-diffusion-img2vid-xt works just fine, so maybe fine-tuning on 14 frames is more reasonable?

danielvegamyhre commented 5 days ago

Another thing is that the default number of frames used in the script is 25. Based on my tests, generating a 14 frame video from the base model stabilityai/stable-video-diffusion-img2vid-xt works just fine, so maybe fine-tuning on 14 frames is more reasonable?

@christopher-beckham can you share more details about the configuration you used to successfully fine tune on 14 frames? How much GPU vRAM? Did you cast unet module to fp16 then recast trainable params as fp32?