kohya-ss / sd-scripts

Apache License 2.0
5.13k stars 855 forks source link

"mixed_precision fp16" not working for Flux #1707

Open vneznaikin opened 5 days ago

vneznaikin commented 5 days ago

Why doesn't mixed_precision fp16 work for Flux? In SDXL it was possible to train in fp16 and the result was quite normal (probably, didn't compare).

If specify mixed_precision fp16 for Flux, the training goes normally and the training speed is 5 times faster than with bf16, but as a result LoRa turns out to be "empty", it does not make any changes when connected.

By the way, how right or wrong is it to train in FP16? Do the gains in speed and resources justify the losses?

kohya-ss commented 4 days ago

In my environment, bf16 training is almost the same speed as fp16 training, so there may be some environment-dependent issue.

FLUX.1 seems to have been originally trained with bf16, so I suspect that training with fp16 may easily cause overflow of the parameters. Training with fp16 is not recommended.

vneznaikin commented 4 days ago

First I need to clarify that I used the lowest recommended settings for up to 1GB of VRAM usage with the "adafactor" optimizer

fp16 may easily cause overflow of the parameters a bit of a vague statement. Adafactor itself has Adaptive Gradient Scaling and Normalization, which should keep Exploding and Vanishing to a minimum. coupled with the error "avr_loss=nan" we need to check whether this is a feature of Flux and not a bug. (I think it would be useful to know the reasons for sure, but I have no ideas either.)

kohya-ss commented 3 days ago

There are still many unknowns regarding the training of FLUX.1. Our understanding is limited because official training scripts and technical papers have not been released. Within this limited scope, we can consider the following:

  1. Regarding the issue with fp16 training, it's possible that overflows or underflows are likely to occur during the loss function calculation process, which may be causing the NaN values.

  2. This problem might be of a nature that cannot be resolved by optimizers such as AdaFactor. While optimizers improve training stability, they cannot overcome fundamental limitations in numerical representation.

  3. The fact that training with bf16 doesn't cause issues also supports this hypothesis.

Therefore, at this point, we believe this phenomenon is likely due to the characteristics of FLUX.1 rather than a bug.

There are still many uncertainties regarding the differences in training speed. If possible, could you provide information about your training environment, particularly the type of GPU and RAM capacity?