Closed JEFFSEVENTHSENSE closed 1 month ago
The error seems to be related to the Text Encoder. Could you please confirm that the CLIP-L checkpoint (clip_l.safetensors) is not of type fp8?
yep the safetensors shouldnt be of type fp8 , it is downloaded through here : https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/clip_l.safetensors
I tried printing some of the parameters dtype and here is what i get for both text encoders.
Parameter: text_model.encoder.layers.10.layer_norm1.weight, dtype: torch.float32 dtype: torch.float32 for the CLIP-L Parameter: text_model.encoder.layers.0.self_attn.k_proj.weight, dtype: torch.float32 dtype: torch.float32 for the t5xxl_fp16
Could it be a torch issue? like older version causing the data to be recognized as a meta tensor
I fixed the issue as i realise that when we are initialising the clip and t5 models they are initialised on the meta device and this is caused by with init_empty_weights(): clip = CLIPTextModel._from_config(config)
so apparently older torch wont change the clip model to cpu/cuda with this initialisation.
now i am currently facing an issue of RuntimeError: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
resolved everything , it was just some library of older pytorch like 1.13.1 is not suitable for the current training scripts as some newer functions in pytorch are used
!/bin/bash
CUDA_VISIBLE_DEVICES=7 accelerate launch \ --mixed_precision bf16 \ --num_cpu_threads_per_process 1 \ flux_train_network.py \ --pretrained_model_name_or_path flux1-schnell.safetensors \ --clip_l sd3/clip_l.safetensors \ --t5xxl sd3/t5xxl_fp16.safetensors \ --ae ae.safetensors \ --save_model_as safetensors \ --sdpa \ --persistent_data_loader_workers \ --max_data_loader_n_workers 2 \ --seed 42 \ --gradient_checkpointing \ --mixed_precision bf16 \ --save_precision bf16 \ --network_module networks.lora_flux \ --network_dim 4 \ --optimizer_type adamw8bit \ --learning_rate 1e-4 \ --highvram \ --max_train_epochs 4 \ --save_every_n_epochs 1 \ --dataset_config dataset_1024_bs2.toml \ --output_dir /home/dluser/development/Jeff/LoRA \ --output_name flux-lora-jeff \ --timestep_sampling shift \ --discrete_flow_shift 3.1582 \ --model_prediction_type raw \ --guidance_scale 1.0 \ --network_train_unet_only
Script to run is above
Error is below
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 56959.68it/s] INFO prepare dataset train_util.py:917 INFO preparing accelerator train_network.py:345 2024-10-08 10:35:57 WARNING Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause other.py:349 the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. flux_utils.py:93
2024-10-08 10:37:12 INFO Building CLIP flux_utils.py:113
INFO Loading state dict from sd3/clip_l.safetensors flux_utils.py:206
INFO Loaded CLIP: flux_utils.py:209
INFO Loading state dict from sd3/t5xxl_fp16.safetensors flux_utils.py:254
INFO Loaded T5xxl: flux_utils.py:257
INFO Building AutoEncoder flux_utils.py:100
INFO Loading state dict from ae.safetensors flux_utils.py:105
INFO Loaded AE: flux_utils.py:108
import network module: networks.lora_flux
Traceback (most recent call last):
File "/home/dluser/development/Jeff/sd-scripts/flux_train_network.py", line 519, in
trainer.train(args)
File "/home/dluser/development/Jeff/sd-scripts/train_network.py", line 402, in train
self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype)
File "/home/dluser/development/Jeff/sd-scripts/flux_train_network.py", line 263, in cache_text_encoder_outputs_if_needed
text_encoders[0].to(accelerator.device, dtype=weight_dtype)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2883, in to
return super().to(*args, **kwargs)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to
return self._apply(convert)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply
module._apply(fn)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply
param_applied = fn(param)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
Traceback (most recent call last):
File "/home/dluser/.virtualenvs/flux/bin/accelerate", line 8, in
sys.exit(main())
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
simple_launcher(args)
File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
accelerator device: cuda INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:48 INFO Building Flux model schnell from BFL checkpoint flux_utils.py:74 2024-10-08 10:37:07 INFO Loading state dict from flux1-schnell.safetensors flux_utils.py:81 2024-10-08 10:37:11 INFO Loaded Flux: