Issue on training Flux1 LORA, and Fine-Tuning pretrained model.

Hibiki82 commented 5 days ago

My GPU is RTX 4090 and 64GB of system ram running on a Linux server host docker container.

NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6

I installed git clone -b sd3-flux.1 https://github.com/bmaltais/kohya_ss.git (I tried with multiples branches)
mix precision of bf16 and tried to train with t5-xxl fp16 and fp8. (failed)
Clip_l fp16
for some reason CPU Offload Checkpointing is always on? (cannot turn off?)
I was trying with 8500 images at a batch size of 1 and still shows not enough VRAM? I've tried with Adafactor and all the other low VRAM required optimizer.
for my dataset around 47 buckets was created. at training resolution of 512x512.

################################################################################################# My training error for Lora is shown below:

/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( 2024-11-07 05:54:17.302121: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-07 05:54:17.302153: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-07 05:54:17.302982: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-07 05:54:17.306710: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-07 05:54:17.846351: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. torch.utils._pytree._register_pytree_node( 2024-11-07 05:54:19 INFO Loading settings from /workspace/kohya_ss/outputs/config_lora-20241107-055412.toml... train_util.py:4435 INFO /workspace/kohya_ss/outputs/config_lora-20241107-055412 train_util.py:4454 INFO highvram is enabled / highvramが有効です train_util.py:4106 WARNING cache_latents_to_disk is enabled, so cache_latents is also enabled / cache_latents_to_diskが有効なため、cache_latentsを有効にします train_util.py:4123 2024-11-07 05:54:19 INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:62 INFO t5xxl_max_token_length: 512 flux_train_network.py:152 /workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565

FLUX: Gradient checkpointing enabled. CPU offload: True prepare optimizer, data loader etc. INFO Text Encoder 1 (CLIP-L): 72 modules, LR 7.5e-06 lora_flux.py:1018 INFO use 8-bit AdamW optimizer | {} train_util.py:4589 override steps. steps for 20 epochs is / 指定エポックまでのステップ数: 40480 enable fp8 training for U-Net. running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 8015 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 2024 num epochs / epoch数: 20 batch size per device / バッチサイズ: 4 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 40480 steps: 0%| | 0/40480 [00:00<?, ?it/s]2024-11-07 05:59:26 INFO unet dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1084 INFO text_encoder [0] dtype: torch.bfloat16, device: cuda:0 train_network.py:1090 INFO text_encoder [1] dtype: torch.bfloat16, device: cuda:0 train_network.py:1090

epoch 1/20 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:715 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:715 Traceback (most recent call last): File "/workspace/kohya_ss/sd-scripts/flux_train_network.py", line 564, in trainer.train(args) File "/workspace/kohya_ss/sd-scripts/train_network.py", line 1159, in train encoded_text_encoder_conds = text_encoding_strategy.encode_tokens( File "/workspace/kohya_ss/sd-scripts/library/strategy_flux.py", line 68, in encode_tokens l_pooled = clip_l(l_tokens.to(clip_l.device))["pooler_output"] File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, *kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward return model_forward(args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in call return convert_to_fp32(self.model_forward(*args, kwargs)) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast return func(*args, *kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 986, in forward return self.text_model( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 890, in forward encoder_outputs = self.encoder( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 805, in forward layer_outputs = self._gradient_checkpointing_func( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner return disable_fn(*args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn return fn(*args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 264, in forward outputs = run_function(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(args, kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 548, in forward hidden_states, attn_weights = self.self_attn( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(args, **kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 480, in forward attn_output = torch.nn.functional.scaled_dot_product_attention( RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph. steps: 0%| | 0/40480 [00:01<?, ?it/s] Traceback (most recent call last): File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in sys.exit(main()) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command simple_launcher(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python3', '/workspace/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', '/workspace/kohya_ss/outputs/config_lora-20241107-055908.toml']' returned non-zero exit status 1. 05:59:29-603512 INFO Training has ended.
################################################################################################# This kept showing up and I tried to reinstall and clear cuDNN and TensorRT.

2024-11-07 05:49:53.805929: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-07 05:49:53.805964: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-07 05:49:53.806946: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-07 05:49:53.810978: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-07 05:49:54.332581: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

kohya-ss commented 5 days ago

This seems to be CuDNN/CUDA/PyTorch issue: https://github.com/huggingface/diffusers/issues/9704

Hibiki82 commented 5 days ago

Hi, when training LoRA and Full fine-tune Flux.1 models, which branch should I be using? And I tired to follow the document on training with 24gb VRAM but I can't get it to work. I can show you the Json and TOML for inspection. The issue of cuDNN is it because of I'm training on my docker container from my own docker image?

kohya-ss / sd-scripts

Issue on training Flux1 LORA, and Fine-Tuning pretrained model. #1767