I anticipate there will be a lot of demand to train (and infer) the new open SOTA image model "Flux".
It's the top model on HF right now. It's a 12B diffusion transformer, which means it's too big to train on a consumer GPU without quantization, and is very slow as-is. The image model community hasn't done QLora training before as models have not been this big.
I appreciate image models are a little different but essentially the rest of the training loop inputs can be cached/adapted easily, so the important part is to reduce the memory use and increase performance of the 12B Transformer model in the following dummy HuggingFace diffusers code:
I anticipate there will be a lot of demand to train (and infer) the new open SOTA image model "Flux". It's the top model on HF right now. It's a 12B diffusion transformer, which means it's too big to train on a consumer GPU without quantization, and is very slow as-is. The image model community hasn't done QLora training before as models have not been this big.
I appreciate image models are a little different but essentially the rest of the training loop inputs can be cached/adapted easily, so the important part is to reduce the memory use and increase performance of the 12B Transformer model in the following dummy HuggingFace diffusers code:
Model code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_flux.py
Would it be possible to consider looking at this?