Open Hibiki82 opened 2 weeks ago
This seems to be CuDNN/CUDA/PyTorch issue: https://github.com/huggingface/diffusers/issues/9704
Hi, when training LoRA and Full fine-tune Flux.1 models, which branch should I be using? And I tired to follow the document on training with 24gb VRAM but I can't get it to work. I can show you the Json and TOML for inspection. The issue of cuDNN is it because of I'm training on my docker container from my own docker image?
batch size per device / バッチサイズ: 4
I can use batch 6-7 on 512 images, batch 2 on 1024.
You have the old branch ? now use sd3
or dev
:
The development version is in the
devbranch. Please check the dev branch for the latest changes. FLUX.1 and SD3/SD3.5 support is done in the
sd3branch. If you want to train them, please use the sd3 branch.
It looks to me that you have a missing PATH to CuDNN and CUDA. Installation script didn't create it ? Had some problem too and now running on bare Arch and it's all working as intended. Make sure you have accelerate
configured and check you have PyTorch >= 2.5.1 ! Use docker image that include cuda-dev
or something like that if possible.
Paste your config if you want more help but for memory you need 64GB RAM that is dedicated to the finetune ! I had to upgrade to 128GB my system, last time usage was 68GB for finetuning with 24GB VRAM, you should have those :
--blocks_to_swap 8 \
--blockwise_fused_optimizer \
--mem_eff_attn \
--cpu_offload_checkpointing \
--mem_eff_save \
--min_snr_gamma 5 \
--gradient_checkpointing \
--highvram \
--xformers \
--sdpa \
--persistent_data_loader_workers \
Hi johnr14, I'm using -b sd3 and with kohya_ss as webui. I was using PyTorch = 2.4.1+cu124 I think with accelerate config fp16 Environment: Docker Container System RAM: 64GB (I think if I break my dataset to many sections of training, then this should be enough) VRAM: 24GB For some reason it kept showing not enough VRAM.
I'll try with updating Pytorch and rebuild everything. choice between fp16 and bf16, what is the preference on precision for finetuning Flux models?
bf16 is preferred precision for FLUX.1 models, because the original checkpoint from BFL is bf16.
My GPU is RTX 4090 and 64GB of system ram running on a Linux server host docker container.
NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6
################################################################################################# My training error for Lora is shown below:
/workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead. torch.utils._pytree._register_pytree_node( 2024-11-07 05:54:17.302121: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-07 05:54:17.302153: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-07 05:54:17.302982: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-07 05:54:17.306710: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-07 05:54:17.846351: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /workspace/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning:torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead. torch.utils._pytree._register_pytree_node( 2024-11-07 05:54:19 INFO Loading settings from /workspace/kohya_ss/outputs/config_lora-20241107-055412.toml... train_util.py:4435 INFO /workspace/kohya_ss/outputs/config_lora-20241107-055412 train_util.py:4454 INFO highvram is enabled / highvramが有効です train_util.py:4106 WARNING cache_latents_to_disk is enabled, so cache_latents is also enabled / cache_latents_to_diskが有効なため、cache_latentsを有効にします train_util.py:4123 2024-11-07 05:54:19 INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:62 INFO t5xxl_max_token_length: 512 flux_train_network.py:152 /workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning:clean_up_tokenization_spaces
was not set. It will be set toTrue
by default. This behavior will be depracted in transformers v4.45, and will be then set toFalse
by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that thelegacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565FLUX: Gradient checkpointing enabled. CPU offload: True prepare optimizer, data loader etc. INFO Text Encoder 1 (CLIP-L): 72 modules, LR 7.5e-06 lora_flux.py:1018 INFO use 8-bit AdamW optimizer | {} train_util.py:4589 override steps. steps for 20 epochs is / 指定エポックまでのステップ数: 40480 enable fp8 training for U-Net. running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 8015 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 2024 num epochs / epoch数: 20 batch size per device / バッチサイズ: 4 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 40480 steps: 0%| | 0/40480 [00:00<?, ?it/s]2024-11-07 05:59:26 INFO unet dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1084 INFO text_encoder [0] dtype: torch.bfloat16, device: cuda:0 train_network.py:1090 INFO text_encoder [1] dtype: torch.bfloat16, device: cuda:0 train_network.py:1090
epoch 1/20 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:715 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:715 Traceback (most recent call last): File "/workspace/kohya_ss/sd-scripts/flux_train_network.py", line 564, in
trainer.train(args)
File "/workspace/kohya_ss/sd-scripts/train_network.py", line 1159, in train
encoded_text_encoder_conds = text_encoding_strategy.encode_tokens(
File "/workspace/kohya_ss/sd-scripts/library/strategy_flux.py", line 68, in encode_tokens
l_pooled = clip_l(l_tokens.to(clip_l.device))["pooler_output"]
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, *kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 819, in forward
return model_forward(args, kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/utils/operations.py", line 807, in call
return convert_to_fp32(self.model_forward(*args, kwargs))
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
return func(*args, *kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 986, in forward
return self.text_model(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 890, in forward
encoder_outputs = self.encoder(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(args, kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 805, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
return disable_fn(*args, kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
return fn(*args, kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
return CheckpointFunction.apply(function, preserve, args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(args, kwargs) # type: ignore[misc]
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 264, in forward
outputs = run_function(args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(args, *kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(args, kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 548, in forward
hidden_states, attn_weights = self.self_attn(
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(args, **kwargs)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 480, in forward
attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: cuDNN Frontend error: [cudnn_frontend] Error: No execution plans support the graph.
steps: 0%| | 0/40480 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/workspace/kohya_ss/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
simple_launcher(args)
File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python3', '/workspace/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', '/workspace/kohya_ss/outputs/config_lora-20241107-055908.toml']' returned non-zero exit status 1.
05:59:29-603512 INFO Training has ended.
################################################################################################# This kept showing up and I tried to reinstall and clear cuDNN and TensorRT.
2024-11-07 05:49:53.805929: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-11-07 05:49:53.805964: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-11-07 05:49:53.806946: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-11-07 05:49:53.810978: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-11-07 05:49:54.332581: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT