(sd3-flux) "NotImplementedError: Cannot copy out of meta tensor; no data!" when trying to train LoRA

envy-ai commented 3 months ago

I've got an RTX 4090 and I'm running the latest commit of the kohya_ss sd3-flux branch on Windows.

Here is my configuration:

ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = false
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/stableDiffusion3SD3_textEncoderClipL.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
gradient_accumulation_steps = 1
gradient_checkpointing = true
highvram = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 160
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 4
network_args = []
network_dim = 8
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "Prodigy"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev-fp8.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
t5xxl = "C:/AI/ComfyUI/models/clip/t5/google_t5-v1_1-xxl_encoderonly-fp8_e4m3fn.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
text_encoder_lr = 0
unet_lr = 1
wandb_run_name = "flux_test1"
xformers = true

Here's the resulting error when I run it:

2024-08-14 15:13:06 INFO     create LoRA network. base dim (rank): 8, alpha: 4                          lora_flux.py:358
                    INFO     neuron dropout: p=None, rank dropout: p=None, module dropout: p=None       lora_flux.py:359
                    INFO     create LoRA for Text Encoder 1:                                            lora_flux.py:430
                    INFO     create LoRA for Text Encoder 2:                                            lora_flux.py:430
                    INFO     create LoRA for Text Encoder: 24 modules.                                  lora_flux.py:435
                    INFO     create LoRA for U-Net: 304 modules.                                        lora_flux.py:439
                    INFO     enable LoRA for U-Net: 304 modules                                         lora_flux.py:482
FLUX: Gradient checkpointing enabled.
prepare optimizer, data loader etc.
                    INFO     use 8-bit AdamW optimizer | {}                                           train_util.py:4346
Traceback (most recent call last):
  File "C:\AI\kohya_flux\sd-scripts\flux_train_network.py", line 397, in <module>
    trainer.train(args)
  File "C:\AI\kohya_flux\sd-scripts\train_network.py", line 563, in train
    unet = accelerator.prepare(unet)
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\accelerator.py", line 1311, in prepare
    result = tuple(
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\accelerator.py", line 1312, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\accelerator.py", line 1188, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\accelerator.py", line 1435, in prepare_model
    model = model.to(self.device)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1160, in to
    return self._apply(convert)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 810, in _apply
    module._apply(fn)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 833, in _apply
    param_applied = fn(param)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!
Traceback (most recent call last):
  File "C:\Users\elbar\anaconda3\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\elbar\anaconda3\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\elbar\anaconda3\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\elbar\\anaconda3\\python.exe', 'C:/AI/kohya_flux/sd-scripts/flux_train_network.py', '--config_file', 'D:/ai/models/lora//config_lora-20240814-131221.toml']' returned non-zero exit status 1.
(base) PS C:\AI\kohya_flux>

Any idea how I can get this to work?

kohya-ss commented 3 months ago

Please use fp16 version of the weights for flux1 dev and t5xxl.

b-7777777 commented 3 months ago

Please use fp16 version of the weights for flux1 dev and t5xxl.

Are there any plans to support optimized models in the future?

kohya-ss commented 3 months ago

Are there any plans to support optimized models in the future?

From my understanding, it is not good to use a quantized model as a base model for training in terms of quality.

Currently, the script uses float8_e4m3fnwhen specifying --fp8_base, so models in this format may be usable. However, it may be difficult to determine which fp8 type used in the model. In addition the type of float8 for training may change or be selectable in future.

envy-ai commented 3 months ago

So, here are my current settings:

ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/stableDiffusion3SD3_textEncoderClipL.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
full_fp16 = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
t5xxl = "C:/AI/ComfyUI/models/clip/t5/google_t5-v1_1-xxl_encoderonly-fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"
xformers = true

I'm using gobs of system memory, to the point where training a lora will take days. Is there anything else I can do to fit into 24G of VRAM?

kohya-ss commented 3 months ago

I'm using gobs of system memory, to the point where training a lora will take days. Is there anything else I can do to fit into 24G of VRAM?

fp8_base base is needed for FLUX.1 LoRA training with 24GB VRAM, and please use sdpa instead of xformers.

envy-ai commented 3 months ago

Okay, so I've tried enabling the fp8 base option (both with and without fp16 training, and using both the fp16 and fp8 versions of the model). Memory usage is fine, but when it starts to train, I get this error:

                    INFO     use 8-bit AdamW optimizer | {}                                           train_util.py:4346
enable fp8 training.
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 105
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 105
  num epochs / epoch数: 16
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 1600
steps:   0%|                                                                                  | 0/1600 [00:00<?, ?it/s]2024-08-15 10:18:27 INFO     unet dtype: torch.float8_e4m3fn, device: cuda:0                       train_network.py:1004
                    INFO     text_encoder dtype: torch.float8_e4m3fn, device: cuda:0               train_network.py:1006
                    INFO     text_encoder dtype: torch.float8_e4m3fn, device: cuda:0               train_network.py:1006

epoch 1/16
                    INFO     epoch is incremented. current_epoch: 0, epoch: 1                          train_util.py:668
Traceback (most recent call last):
  File "C:\AI\kohya_flux\sd-scripts\flux_train_network.py", line 397, in <module>
    trainer.train(args)
  File "C:\AI\kohya_flux\sd-scripts\train_network.py", line 1076, in train
    text_encoder_conds = text_encoding_strategy.encode_tokens(
  File "C:\AI\kohya_flux\sd-scripts\library\strategy_flux.py", line 74, in encode_tokens
    t5_out, _ = t5xxl(t5_tokens.to(t5xxl.device), return_dict=False, output_hidden_states=True)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\lib\site-packages\transformers\models\t5\modeling_t5.py", line 1971, in forward
    encoder_outputs = self.encoder(
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\lib\site-packages\transformers\models\t5\modeling_t5.py", line 1012, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\sparse.py", line 162, in forward
    return F.embedding(
  File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: "index_select_cuda" not implemented for 'Float8_e4m3fn'
steps:   0%|                                                                                  | 0/1600 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\elbar\anaconda3\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\elbar\anaconda3\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\elbar\anaconda3\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\elbar\\anaconda3\\python.exe', 'C:/AI/kohya_flux/sd-scripts/flux_train_network.py', '--config_file', 'D:/ai/models/lora//config_lora-20240815-101724.toml']' returned non-zero exit status 1.
10:18:29-302861 INFO     Training has ended.

My latest settings are here:

ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/stableDiffusion3SD3_textEncoderClipL.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
fp8_base = true
full_fp16 = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
t5xxl = "C:/AI/ComfyUI/models/clip/t5/google_t5-v1_1-xxl_encoderonly-fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"

kohya-ss commented 3 months ago

Please add cache_text_encoder_outputs (and cache_text_encoder_outputs_to_disk if needed) option. Sorry I missed it last time.

envy-ai commented 3 months ago

Same error. Here are my settings, in case I messed something else up while I was fiddling with it:

ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/stableDiffusion3SD3_textEncoderClipL.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
flux1_cache_text_encoder_outputs = true
flux1_cache_text_encoder_outputs_to_disk = true
fp8_base = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
t5xxl = "C:/AI/ComfyUI/models/clip/t5/google_t5-v1_1-xxl_encoderonly-fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"

Apparently somebody had the same error using ComfyUI:

https://github.com/comfyanonymous/ComfyUI/issues/3725

I'm downloading different versions of t5 and clip, because maybe those are the problem. I'll let you know how it goes.

envy-ai commented 3 months ago

Update: No dice. Got the same error... here are my settings:

ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/clip_l.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
flux1_cache_text_encoder_outputs = true
flux1_cache_text_encoder_outputs_to_disk = true
fp8_base = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
t5xxl = "C:/AI/ComfyUI/models/t5/t5xxl_fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"

terrificdm commented 3 months ago

I'm using gobs of system memory, to the point where training a lora will take days. Is there anything else I can do to fit into 24G of VRAM?

fp8_base base is needed for FLUX.1 LoRA training with 24GB VRAM, and please use sdpa instead of xformers.

Why use "sdpa" instead of "xformers"? Thanks.

envy-ai commented 3 months ago

Any chance somebody could post some working 3090/4090 settings and I could just go from there? :)

kohya-ss commented 3 months ago

@envy-ai Have you updated PyTorch to 2.4.0 (and torchvision)? If not, please follow the instruction on README: https://github.com/kohya-ss/sd-scripts/tree/sd3

kohya-ss commented 3 months ago

Why use "sdpa" instead of "xformers"? Thanks.

Because FLUX.1 models don't support xformers yet. Even if you specify --xformers, it is ignored, and the code runs with SDPA of PyTorch.

envy-ai commented 3 months ago

@kohya-ss I tried that just now, and I'm still getting the same error.

20:05:36-459578 INFO     Kohya_ss GUI version: v24.2.0

20:05:36-787748 INFO     Submodule initialized and updated.
20:05:36-787748 INFO     nVidia toolkit detected
20:05:38-256639 INFO     Torch 2.4.0+cu124
20:05:38-286287 INFO     Torch backend: nVidia CUDA 12.4 cuDNN 90100
20:05:38-287789 INFO     Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128
20:05:38-287789 INFO     Python version is 3.10.9 | packaged by Anaconda, Inc. | (main, Mar  8 2023, 10:42:25) [MSC
                         v.1916 64 bit (AMD64)]
20:05:38-287789 INFO     Verifying modules installation status from requirements_pytorch_windows.txt...
20:05:38-287789 INFO     Verifying modules installation status from requirements_windows.txt...
20:05:38-287789 INFO     Verifying modules installation status from requirements.txt...
20:05:46-365988 INFO     headless: False
20:05:46-412900 INFO     Using shell=True when running external commands...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
20:06:34-740634 INFO     Start training LoRA Flux1 ...
20:06:34-756260 INFO     Validating lr scheduler arguments...
20:06:34-756260 INFO     Validating optimizer arguments...
20:06:34-756260 INFO     Validating D:/ai/models/lora/ existence and writability... SUCCESS
20:06:34-756260 INFO     Validating C:/AI/ComfyUI/models/unet/FLUX1/flux1-dev.safetensors existence... SUCCESS
20:06:34-756260 INFO     Validating C:/AI/training_data/jrpg_character_designs existence... SUCCESS
20:06:34-756260 INFO     Folder 5_images: 5 repeats found
20:06:34-756260 INFO     Folder 5_images: 21 images found
20:06:34-756260 INFO     Folder 5_images: 21 * 5 = 105 steps
20:06:34-756260 INFO     Regulatization factor: 1
20:06:34-756260 INFO     Total steps: 105
20:06:34-756260 INFO     Train batch size: 1
20:06:34-756260 INFO     Gradient accumulation steps: 1
20:06:34-756260 INFO     Epoch: 12
20:06:34-756260 INFO     Max train steps: 1600
20:06:34-756260 INFO     stop_text_encoder_training = 0
20:06:34-756260 INFO     lr_warmup_steps = 0
20:06:34-771887 INFO     Saving training config to D:/ai/models/lora/flux_test1_20240815-200634.json...
20:06:34-771887 INFO     Executing command: C:\Users\elbar\anaconda3\envs\kohya_flux\Scripts\accelerate.EXE launch
                         --dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1
                         --num_machines 1 --num_cpu_threads_per_process 2
                         C:/AI/kohya_flux/sd-scripts/flux_train_network.py --config_file
                         D:/ai/models/lora//config_lora-20240815-200634.toml
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
2024-08-15 20:06:44 INFO     Loading settings from                                                    train_util.py:4193
                             D:/ai/models/lora//config_lora-20240815-200634.toml...
                    INFO     D:/ai/models/lora//config_lora-20240815-200634                           train_util.py:4212
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
2024-08-15 20:06:44 INFO     Using DreamBooth method.                                               train_network.py:276
                    INFO     prepare images.                                                          train_util.py:1807
                    INFO     get image size from name of cache files                                  train_util.py:1745
100%|████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 1315.75it/s]
                    INFO     set image size from cache files: 21/21                                   train_util.py:1752
                    INFO     found directory C:\AI\training_data\jrpg_character_designs\5_images      train_util.py:1754
                             contains 21 image files
                    INFO     105 train images with repeating.                                         train_util.py:1848
                    INFO     0 reg images.                                                            train_util.py:1851
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1856
                    INFO     [Dataset 0]                                                              config_util.py:570
                               batch_size: 1
                               resolution: (512, 512)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 2048
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir: "C:\AI\training_data\jrpg_character_designs\5_images"
                                 image_count: 21
                                 num_repeats: 5
                                 shuffle_caption: False
                                 keep_tokens: 0
                                 keep_tokens_separator:
                                 caption_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 alpha_mask: False,
                                 is_reg: False
                                 class_tokens: images
                                 caption_extension: .txt

                    INFO     [Dataset 0]                                                              config_util.py:576
                    INFO     loading image sizes.                                                      train_util.py:876
100%|██████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<?, ?it/s]
                    INFO     make buckets                                                              train_util.py:882
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:899
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     number of images (including repeats) /                                    train_util.py:928
                             各bucketの画像枚数（繰り返し回数を含む）
                    INFO     bucket 0: resolution (384, 512), count: 35                                train_util.py:933
                    INFO     bucket 1: resolution (384, 576), count: 25                                train_util.py:933
                    INFO     bucket 2: resolution (384, 640), count: 10                                train_util.py:933
                    INFO     bucket 3: resolution (448, 512), count: 20                                train_util.py:933
                    INFO     bucket 4: resolution (448, 576), count: 10                                train_util.py:933
                    INFO     bucket 5: resolution (512, 512), count: 5                                 train_util.py:933
                    INFO     mean ar error (without repeats): 0.02151275958865067                      train_util.py:938
                    INFO     preparing accelerator                                                  train_network.py:329
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\accelerate\accelerator.py:488: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
accelerator device: cuda
                    INFO     Building CLIP                                                              flux_utils.py:48
2024-08-15 20:06:45 INFO     Loading state dict from C:/AI/ComfyUI/models/clip/clip_l.safetensors      flux_utils.py:141
                    INFO     Loaded CLIP: <All keys matched successfully>                              flux_utils.py:144
                    INFO     Loading state dict from C:/AI/ComfyUI/models/t5/t5xxl_fp16.safetensors    flux_utils.py:187
                    INFO     Loaded T5xxl: <All keys matched successfully>                             flux_utils.py:190
                    INFO     Building Flux model dev                                                    flux_utils.py:23
                    INFO     Loading state dict from                                                    flux_utils.py:28
                             C:/AI/ComfyUI/models/unet/FLUX1/flux1-dev.safetensors
                    INFO     Loaded Flux: <All keys matched successfully>                               flux_utils.py:31
                    INFO     Building AutoEncoder                                                       flux_utils.py:36
                    INFO     Loading state dict from C:/AI/ComfyUI/models/flux/ae.sft                   flux_utils.py:40
                    INFO     Loaded AE: <All keys matched successfully>                                 flux_utils.py:43
import network module: networks.lora_flux
                    INFO     [Dataset 0]                                                              train_util.py:2330
                    INFO     caching latents with caching strategy.                                    train_util.py:984
                    INFO     checking cache validity...                                                train_util.py:994
100%|████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 2621.75it/s]
                    INFO     no latents to cache                                                      train_util.py:1034
2024-08-15 20:06:50 INFO     create LoRA network. base dim (rank): 4, alpha: 2                          lora_flux.py:358
                    INFO     neuron dropout: p=None, rank dropout: p=None, module dropout: p=None       lora_flux.py:359
                    INFO     create LoRA for Text Encoder 1:                                            lora_flux.py:430
                    INFO     create LoRA for Text Encoder 2:                                            lora_flux.py:430
                    INFO     create LoRA for Text Encoder: 24 modules.                                  lora_flux.py:435
                    INFO     create LoRA for U-Net: 304 modules.                                        lora_flux.py:439
                    INFO     enable LoRA for U-Net: 304 modules                                         lora_flux.py:482
FLUX: Gradient checkpointing enabled.
prepare optimizer, data loader etc.
                    INFO     use 8-bit AdamW optimizer | {}                                           train_util.py:4346
enable fp8 training.
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 105
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 105
  num epochs / epoch数: 16
  batch size per device / バッチサイズ: 1
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 1600
steps:   0%|                                                                                  | 0/1600 [00:00<?, ?it/s]2024-08-15 20:08:00 INFO     unet dtype: torch.float8_e4m3fn, device: cuda:0                       train_network.py:1004
                    INFO     text_encoder dtype: torch.float8_e4m3fn, device: cuda:0               train_network.py:1006
                    INFO     text_encoder dtype: torch.float8_e4m3fn, device: cuda:0               train_network.py:1006

epoch 1/16
                    INFO     epoch is incremented. current_epoch: 0, epoch: 1                          train_util.py:668
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\utils\checkpoint.py:92: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\transformers\models\clip\modeling_clip.py:480: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
  File "C:\AI\kohya_flux\sd-scripts\flux_train_network.py", line 397, in <module>
    trainer.train(args)
  File "C:\AI\kohya_flux\sd-scripts\train_network.py", line 1076, in train
    text_encoder_conds = text_encoding_strategy.encode_tokens(
  File "C:\AI\kohya_flux\sd-scripts\library\strategy_flux.py", line 74, in encode_tokens
    t5_out, _ = t5xxl(t5_tokens.to(t5xxl.device), return_dict=False, output_hidden_states=True)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\transformers\models\t5\modeling_t5.py", line 1971, in forward
    encoder_outputs = self.encoder(
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\transformers\models\t5\modeling_t5.py", line 1012, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\sparse.py", line 164, in forward
    return F.embedding(
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\functional.py", line 2267, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: "index_select_cuda" not implemented for 'Float8_e4m3fn'
steps:   0%|                                                                                  | 0/1600 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\elbar\\anaconda3\\envs\\kohya_flux\\python.exe', 'C:/AI/kohya_flux/sd-scripts/flux_train_network.py', '--config_file', 'D:/ai/models/lora//config_lora-20240815-200634.toml']' returned non-zero exit status 1.
20:08:02-944389 INFO     Training has ended.

current settings, in case I changed them since the last one:

ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/clip_l.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
flux1_cache_text_encoder_outputs = true
flux1_cache_text_encoder_outputs_to_disk = true
fp8_base = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/unet/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
t5xxl = "C:/AI/ComfyUI/models/t5/t5xxl_fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"

kohya-ss commented 3 months ago

Please use cache_text_encoder_outputs and cache_text_encoder_outputs_to_disk, without flux1_ prefix.

envy-ai commented 3 months ago

That was it! Thank you, and sorry I missed that the first time around. Not only is it working, it significantly reduced VRAM usage.

kohya-ss / sd-scripts

(sd3-flux) "NotImplementedError: Cannot copy out of meta tensor; no data!" when trying to train LoRA #1454