Open setothegreat opened 2 months ago
The model parameters need to be stored in VRAM in bf16, which consumes 22GB of VRAM (block swap is implemented to reduce that). Therefore, training some layers will not help much in reducing VRAM.
The model parameters need to be stored in VRAM in bf16, which consumes 22GB of VRAM (block swap is implemented to reduce that). Therefore, training some layers will not help much in reducing VRAM.
It would help for storage requirements which will lower hardware requirements generally. If I can get LORAs at sub 50mb I'd be ecstatic. I have so many and I would love to keep training. Please add this functionality!
It would help for storage requirements which will lower hardware requirements generally. If I can get LORAs at sub 50mb I'd be ecstatic. I have so many and I would love to keep training. Please add this functionality!
For LoRA, this has been already implemented: https://github.com/kohya-ss/sd-scripts/tree/sd3?tab=readme-ov-file#specify-blocks-to-train-in-flux1-lora-training
Since it appears that Flux LoRA training can still be effective when only training specific layers, I am wondering if this functionality can be expanded to Finetuning, since this is where the biggest roadblocks pertaining to speed and hardware currently lie. Rather than being limited to Adafactor and dozens of hours per training iteration, being able to specify a subset of layers to train seems like it should lower hardware requirements, thus allowing the use of potentially more efficient optimizers on consumer-grade hardware, and could bring the training time down by an order of magnitude.
Is there some sort of architecture-level roadblock that prevents specific layer training when doing a full finetune that doesn't exist when training a LoRA that I'm not aware of?