Closed yunniw2001 closed 5 months ago
Based on my testing, if you use their default temporal layer-only fine-tuning method, with 320x448 (No CFG), I only need approximately 23GB memory. Thanks!
Based on my testing, if you use their default temporal layer-only fine-tuning method, with 320x448 (No CFG), I only need approximately 23GB memory. Thanks!
Hi, much thanks for your reply. I test the code on my machine with 40GB VRAM with [height,width] set as [448,320] as you mentioned, but it still reports Cuda out of memory error. I checked the code, it has set only the temporal layer with grad. So how can you finetune temporal layer in 23GB VRAM?
Thank you for your patience and help.
Ah, I got it. Now I can finetune svd on one GPU with about 31GB VRAM. Thanks for your code and information. All the best!
@yunniw2001 Hi what have you done for just requiring 31GB; I tested with 512x320 images, GPU costs is about 52G;
I modify the code and add lora. Then freeze all the unet parameter, except the lora weight. This can significantly reduce vram demand. But the performance is much worse than fine-tune the original attention weight.. Also, during training, after calling the vae and other modules, I unload them to the CPU, to save VRAM. But I highly recommend to use GPU with larger VRAM. All my modification is due to the limits of my server....
Ah, also I load unet weight dtype in float16, and only set lora weight dtype to float32, this can also reduce VRAM demand. maybe a trade-off in fine-tuning result, I'm not sure.
@yunniw2001 Thanks for your quick reply. Yes, using lora indeed save GPU mem; But I actually want to train the temporal modules. And I found 80G mem is not enough. I also tried to move Unet to fp16, but the following error occurred:
Steps: 0%| | 0/200000 [00:00<?, ?it/s]Traceback (most recent call last):
File "/mnt/workspace/Diffusion/svd/ashui_train_svd.py", line 1214, in
@yunniw2001 Ignore this above error, the forward process also requires much larger GPU mem than 80G in my case. So I guess svd_xt might be trained on other more powerful devices.
I modify the code and add lora. Then freeze all the unet parameter, except the lora weight. This can significantly reduce vram demand. But the performance is much worse than fine-tune the original attention weight.. Also, during training, after calling the vae and other modules, I unload them to the CPU, to save VRAM. But I highly recommend to use GPU with larger VRAM. All my modification is due to the limits of my server....
@yunniw2001 Hey! Would you be able to share your code in which you use lora for finetuning?
I'm not sure why the unet is being stored in float32
, it seems like that line is commented out in the training script.
I can't speak on training stability if you do this, but at least in terms of being able to make training feasible you should be casting the entire unet to torch.float16
and re-casting the trainable parameters to torch.float32
. A fine-tuning script which requires an 80GB GPU is quite overkill, IMO.
Another thing is that the default number of frames used in the script is 25
. Based on my tests, generating a 14 frame video from the base model stabilityai/stable-video-diffusion-img2vid-xt
works just fine, so maybe fine-tuning on 14
frames is more reasonable?
Another thing is that the default number of frames used in the script is
25
. Based on my tests, generating a 14 frame video from the base modelstabilityai/stable-video-diffusion-img2vid-xt
works just fine, so maybe fine-tuning on14
frames is more reasonable?
@christopher-beckham can you share more details about the configuration you used to successfully fine tune on 14 frames? How much GPU vRAM? Did you cast unet module to fp16 then recast trainable params as fp32?
Hi, your code is very useful!
And I wonder the GPU setting when you are trying. Does it require 80G GPU memory? or some size smaller?
Thanks a lot!