kohya-ss / sd-scripts

Apache License 2.0
5.31k stars 880 forks source link

Having NotImplementedError: Cannot copy out of meta tensor; no data! Issue #1679

Closed JEFFSEVENTHSENSE closed 1 month ago

JEFFSEVENTHSENSE commented 1 month ago

!/bin/bash

CUDA_VISIBLE_DEVICES=7 accelerate launch \ --mixed_precision bf16 \ --num_cpu_threads_per_process 1 \ flux_train_network.py \ --pretrained_model_name_or_path flux1-schnell.safetensors \ --clip_l sd3/clip_l.safetensors \ --t5xxl sd3/t5xxl_fp16.safetensors \ --ae ae.safetensors \ --save_model_as safetensors \ --sdpa \ --persistent_data_loader_workers \ --max_data_loader_n_workers 2 \ --seed 42 \ --gradient_checkpointing \ --mixed_precision bf16 \ --save_precision bf16 \ --network_module networks.lora_flux \ --network_dim 4 \ --optimizer_type adamw8bit \ --learning_rate 1e-4 \ --highvram \ --max_train_epochs 4 \ --save_every_n_epochs 1 \ --dataset_config dataset_1024_bs2.toml \ --output_dir /home/dluser/development/Jeff/LoRA \ --output_name flux-lora-jeff \ --timestep_sampling shift \ --discrete_flow_shift 3.1582 \ --model_prediction_type raw \ --guidance_scale 1.0 \ --network_train_unet_only

Script to run is above

Error is below

              INFO     [Dataset 0]                                                                                 config_util.py:576
                INFO     loading image sizes.                                                                         train_util.py:909

100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 56959.68it/s] INFO prepare dataset train_util.py:917 INFO preparing accelerator train_network.py:345 2024-10-08 10:35:57 WARNING Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause other.py:349 the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
accelerator device: cuda INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:48 INFO Building Flux model schnell from BFL checkpoint flux_utils.py:74 2024-10-08 10:37:07 INFO Loading state dict from flux1-schnell.safetensors flux_utils.py:81 2024-10-08 10:37:11 INFO Loaded Flux: flux_utils.py:93 2024-10-08 10:37:12 INFO Building CLIP flux_utils.py:113 INFO Loading state dict from sd3/clip_l.safetensors flux_utils.py:206 INFO Loaded CLIP: flux_utils.py:209 INFO Loading state dict from sd3/t5xxl_fp16.safetensors flux_utils.py:254 INFO Loaded T5xxl: flux_utils.py:257 INFO Building AutoEncoder flux_utils.py:100 INFO Loading state dict from ae.safetensors flux_utils.py:105 INFO Loaded AE: flux_utils.py:108 import network module: networks.lora_flux Traceback (most recent call last): File "/home/dluser/development/Jeff/sd-scripts/flux_train_network.py", line 519, in trainer.train(args) File "/home/dluser/development/Jeff/sd-scripts/train_network.py", line 402, in train self.cache_text_encoder_outputs_if_needed(args, accelerator, unet, vae, text_encoders, train_dataset_group, weight_dtype) File "/home/dluser/development/Jeff/sd-scripts/flux_train_network.py", line 263, in cache_text_encoder_outputs_if_needed text_encoders[0].to(accelerator.device, dtype=weight_dtype) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2883, in to return super().to(*args, **kwargs) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 989, in to return self._apply(convert) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 641, in _apply module._apply(fn) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 664, in _apply param_applied = fn(param) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 987, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) NotImplementedError: Cannot copy out of meta tensor; no data! Traceback (most recent call last): File "/home/dluser/.virtualenvs/flux/bin/accelerate", line 8, in sys.exit(main()) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command simple_launcher(args) File "/home/dluser/.virtualenvs/flux/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

kohya-ss commented 1 month ago

The error seems to be related to the Text Encoder. Could you please confirm that the CLIP-L checkpoint (clip_l.safetensors) is not of type fp8?

JEFFSEVENTHSENSE commented 1 month ago

yep the safetensors shouldnt be of type fp8 , it is downloaded through here : https://huggingface.co/comfyanonymous/flux_text_encoders/blob/main/clip_l.safetensors

I tried printing some of the parameters dtype and here is what i get for both text encoders.

Parameter: text_model.encoder.layers.10.layer_norm1.weight, dtype: torch.float32 dtype: torch.float32 for the CLIP-L Parameter: text_model.encoder.layers.0.self_attn.k_proj.weight, dtype: torch.float32 dtype: torch.float32 for the t5xxl_fp16

image

Could it be a torch issue? like older version causing the data to be recognized as a meta tensor

JEFFSEVENTHSENSE commented 1 month ago

I fixed the issue as i realise that when we are initialising the clip and t5 models they are initialised on the meta device and this is caused by with init_empty_weights(): clip = CLIPTextModel._from_config(config)

so apparently older torch wont change the clip model to cpu/cuda with this initialisation.

now i am currently facing an issue of RuntimeError: Expected (head_size % 8 == 0) && (head_size <= 128) to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

JEFFSEVENTHSENSE commented 1 month ago

resolved everything , it was just some library of older pytorch like 1.13.1 is not suitable for the current training scripts as some newer functions in pytorch are used