Closed envy-ai closed 3 months ago
Please use fp16 version of the weights for flux1 dev and t5xxl.
Please use fp16 version of the weights for flux1 dev and t5xxl.
Are there any plans to support optimized models in the future?
Are there any plans to support optimized models in the future?
From my understanding, it is not good to use a quantized model as a base model for training in terms of quality.
Currently, the script uses float8_e4m3fnwhen specifying --fp8_base
, so models in this format may be usable. However, it may be difficult to determine which fp8 type used in the model. In addition the type of float8 for training may change or be selectable in future.
So, here are my current settings:
ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/stableDiffusion3SD3_textEncoderClipL.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
full_fp16 = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
t5xxl = "C:/AI/ComfyUI/models/clip/t5/google_t5-v1_1-xxl_encoderonly-fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"
xformers = true
I'm using gobs of system memory, to the point where training a lora will take days. Is there anything else I can do to fit into 24G of VRAM?
I'm using gobs of system memory, to the point where training a lora will take days. Is there anything else I can do to fit into 24G of VRAM?
fp8_base
base is needed for FLUX.1 LoRA training with 24GB VRAM, and please use sdpa
instead of xformers
.
Okay, so I've tried enabling the fp8 base option (both with and without fp16 training, and using both the fp16 and fp8 versions of the model). Memory usage is fine, but when it starts to train, I get this error:
INFO use 8-bit AdamW optimizer | {} train_util.py:4346
enable fp8 training.
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 105
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 105
num epochs / epoch数: 16
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1600
steps: 0%| | 0/1600 [00:00<?, ?it/s]2024-08-15 10:18:27 INFO unet dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1004
INFO text_encoder dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1006
INFO text_encoder dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1006
epoch 1/16
INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668
Traceback (most recent call last):
File "C:\AI\kohya_flux\sd-scripts\flux_train_network.py", line 397, in <module>
trainer.train(args)
File "C:\AI\kohya_flux\sd-scripts\train_network.py", line 1076, in train
text_encoder_conds = text_encoding_strategy.encode_tokens(
File "C:\AI\kohya_flux\sd-scripts\library\strategy_flux.py", line 74, in encode_tokens
t5_out, _ = t5xxl(t5_tokens.to(t5xxl.device), return_dict=False, output_hidden_states=True)
File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\elbar\anaconda3\lib\site-packages\transformers\models\t5\modeling_t5.py", line 1971, in forward
encoder_outputs = self.encoder(
File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\elbar\anaconda3\lib\site-packages\transformers\models\t5\modeling_t5.py", line 1012, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\modules\sparse.py", line 162, in forward
return F.embedding(
File "C:\Users\elbar\anaconda3\lib\site-packages\torch\nn\functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: "index_select_cuda" not implemented for 'Float8_e4m3fn'
steps: 0%| | 0/1600 [00:00<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\elbar\anaconda3\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\elbar\anaconda3\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\elbar\anaconda3\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "C:\Users\elbar\anaconda3\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\elbar\\anaconda3\\python.exe', 'C:/AI/kohya_flux/sd-scripts/flux_train_network.py', '--config_file', 'D:/ai/models/lora//config_lora-20240815-101724.toml']' returned non-zero exit status 1.
10:18:29-302861 INFO Training has ended.
My latest settings are here:
ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/stableDiffusion3SD3_textEncoderClipL.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
fp8_base = true
full_fp16 = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
t5xxl = "C:/AI/ComfyUI/models/clip/t5/google_t5-v1_1-xxl_encoderonly-fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"
Please add cache_text_encoder_outputs
(and cache_text_encoder_outputs_to_disk
if needed) option. Sorry I missed it last time.
Same error. Here are my settings, in case I messed something else up while I was fiddling with it:
ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/stableDiffusion3SD3_textEncoderClipL.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
flux1_cache_text_encoder_outputs = true
flux1_cache_text_encoder_outputs_to_disk = true
fp8_base = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
t5xxl = "C:/AI/ComfyUI/models/clip/t5/google_t5-v1_1-xxl_encoderonly-fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"
Apparently somebody had the same error using ComfyUI:
https://github.com/comfyanonymous/ComfyUI/issues/3725
I'm downloading different versions of t5 and clip, because maybe those are the problem. I'll let you know how it goes.
Update: No dice. Got the same error... here are my settings:
ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/clip_l.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
flux1_cache_text_encoder_outputs = true
flux1_cache_text_encoder_outputs_to_disk = true
fp8_base = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/checkpoints/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
t5xxl = "C:/AI/ComfyUI/models/t5/t5xxl_fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"
I'm using gobs of system memory, to the point where training a lora will take days. Is there anything else I can do to fit into 24G of VRAM?
fp8_base
base is needed for FLUX.1 LoRA training with 24GB VRAM, and please usesdpa
instead ofxformers
.
Why use "sdpa" instead of "xformers"? Thanks.
Any chance somebody could post some working 3090/4090 settings and I could just go from there? :)
@envy-ai Have you updated PyTorch to 2.4.0 (and torchvision)? If not, please follow the instruction on README: https://github.com/kohya-ss/sd-scripts/tree/sd3
Why use "sdpa" instead of "xformers"? Thanks.
Because FLUX.1 models don't support xformers yet. Even if you specify --xformers
, it is ignored, and the code runs with SDPA of PyTorch.
@kohya-ss I tried that just now, and I'm still getting the same error.
20:05:36-459578 INFO Kohya_ss GUI version: v24.2.0
20:05:36-787748 INFO Submodule initialized and updated.
20:05:36-787748 INFO nVidia toolkit detected
20:05:38-256639 INFO Torch 2.4.0+cu124
20:05:38-286287 INFO Torch backend: nVidia CUDA 12.4 cuDNN 90100
20:05:38-287789 INFO Torch detected GPU: NVIDIA GeForce RTX 4090 VRAM 24564 Arch (8, 9) Cores 128
20:05:38-287789 INFO Python version is 3.10.9 | packaged by Anaconda, Inc. | (main, Mar 8 2023, 10:42:25) [MSC
v.1916 64 bit (AMD64)]
20:05:38-287789 INFO Verifying modules installation status from requirements_pytorch_windows.txt...
20:05:38-287789 INFO Verifying modules installation status from requirements_windows.txt...
20:05:38-287789 INFO Verifying modules installation status from requirements.txt...
20:05:46-365988 INFO headless: False
20:05:46-412900 INFO Using shell=True when running external commands...
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
20:06:34-740634 INFO Start training LoRA Flux1 ...
20:06:34-756260 INFO Validating lr scheduler arguments...
20:06:34-756260 INFO Validating optimizer arguments...
20:06:34-756260 INFO Validating D:/ai/models/lora/ existence and writability... SUCCESS
20:06:34-756260 INFO Validating C:/AI/ComfyUI/models/unet/FLUX1/flux1-dev.safetensors existence... SUCCESS
20:06:34-756260 INFO Validating C:/AI/training_data/jrpg_character_designs existence... SUCCESS
20:06:34-756260 INFO Folder 5_images: 5 repeats found
20:06:34-756260 INFO Folder 5_images: 21 images found
20:06:34-756260 INFO Folder 5_images: 21 * 5 = 105 steps
20:06:34-756260 INFO Regulatization factor: 1
20:06:34-756260 INFO Total steps: 105
20:06:34-756260 INFO Train batch size: 1
20:06:34-756260 INFO Gradient accumulation steps: 1
20:06:34-756260 INFO Epoch: 12
20:06:34-756260 INFO Max train steps: 1600
20:06:34-756260 INFO stop_text_encoder_training = 0
20:06:34-756260 INFO lr_warmup_steps = 0
20:06:34-771887 INFO Saving training config to D:/ai/models/lora/flux_test1_20240815-200634.json...
20:06:34-771887 INFO Executing command: C:\Users\elbar\anaconda3\envs\kohya_flux\Scripts\accelerate.EXE launch
--dynamo_backend no --dynamo_mode default --mixed_precision fp16 --num_processes 1
--num_machines 1 --num_cpu_threads_per_process 2
C:/AI/kohya_flux/sd-scripts/flux_train_network.py --config_file
D:/ai/models/lora//config_lora-20240815-200634.toml
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-08-15 20:06:44 INFO Loading settings from train_util.py:4193
D:/ai/models/lora//config_lora-20240815-200634.toml...
INFO D:/ai/models/lora//config_lora-20240815-200634 train_util.py:4212
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
2024-08-15 20:06:44 INFO Using DreamBooth method. train_network.py:276
INFO prepare images. train_util.py:1807
INFO get image size from name of cache files train_util.py:1745
100%|████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 1315.75it/s]
INFO set image size from cache files: 21/21 train_util.py:1752
INFO found directory C:\AI\training_data\jrpg_character_designs\5_images train_util.py:1754
contains 21 image files
INFO 105 train images with repeating. train_util.py:1848
INFO 0 reg images. train_util.py:1851
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1856
INFO [Dataset 0] config_util.py:570
batch_size: 1
resolution: (512, 512)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True
[Subset 0 of Dataset 0]
image_dir: "C:\AI\training_data\jrpg_character_designs\5_images"
image_count: 21
num_repeats: 5
shuffle_caption: False
keep_tokens: 0
keep_tokens_separator:
caption_separator: ,
secondary_separator: None
enable_wildcard: False
caption_dropout_rate: 0.0
caption_dropout_every_n_epoches: 0
caption_tag_dropout_rate: 0.0
caption_prefix: None
caption_suffix: None
color_aug: False
flip_aug: False
face_crop_aug_range: None
random_crop: False
token_warmup_min: 1,
token_warmup_step: 0,
alpha_mask: False,
is_reg: False
class_tokens: images
caption_extension: .txt
INFO [Dataset 0] config_util.py:576
INFO loading image sizes. train_util.py:876
100%|██████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<?, ?it/s]
INFO make buckets train_util.py:882
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is train_util.py:899
set, because bucket reso is defined by image size automatically /
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO number of images (including repeats) / train_util.py:928
各bucketの画像枚数(繰り返し回数を含む)
INFO bucket 0: resolution (384, 512), count: 35 train_util.py:933
INFO bucket 1: resolution (384, 576), count: 25 train_util.py:933
INFO bucket 2: resolution (384, 640), count: 10 train_util.py:933
INFO bucket 3: resolution (448, 512), count: 20 train_util.py:933
INFO bucket 4: resolution (448, 576), count: 10 train_util.py:933
INFO bucket 5: resolution (512, 512), count: 5 train_util.py:933
INFO mean ar error (without repeats): 0.02151275958865067 train_util.py:938
INFO preparing accelerator train_network.py:329
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\accelerate\accelerator.py:488: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
self.scaler = torch.cuda.amp.GradScaler(**kwargs)
accelerator device: cuda
INFO Building CLIP flux_utils.py:48
2024-08-15 20:06:45 INFO Loading state dict from C:/AI/ComfyUI/models/clip/clip_l.safetensors flux_utils.py:141
INFO Loaded CLIP: <All keys matched successfully> flux_utils.py:144
INFO Loading state dict from C:/AI/ComfyUI/models/t5/t5xxl_fp16.safetensors flux_utils.py:187
INFO Loaded T5xxl: <All keys matched successfully> flux_utils.py:190
INFO Building Flux model dev flux_utils.py:23
INFO Loading state dict from flux_utils.py:28
C:/AI/ComfyUI/models/unet/FLUX1/flux1-dev.safetensors
INFO Loaded Flux: <All keys matched successfully> flux_utils.py:31
INFO Building AutoEncoder flux_utils.py:36
INFO Loading state dict from C:/AI/ComfyUI/models/flux/ae.sft flux_utils.py:40
INFO Loaded AE: <All keys matched successfully> flux_utils.py:43
import network module: networks.lora_flux
INFO [Dataset 0] train_util.py:2330
INFO caching latents with caching strategy. train_util.py:984
INFO checking cache validity... train_util.py:994
100%|████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<00:00, 2621.75it/s]
INFO no latents to cache train_util.py:1034
2024-08-15 20:06:50 INFO create LoRA network. base dim (rank): 4, alpha: 2 lora_flux.py:358
INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora_flux.py:359
INFO create LoRA for Text Encoder 1: lora_flux.py:430
INFO create LoRA for Text Encoder 2: lora_flux.py:430
INFO create LoRA for Text Encoder: 24 modules. lora_flux.py:435
INFO create LoRA for U-Net: 304 modules. lora_flux.py:439
INFO enable LoRA for U-Net: 304 modules lora_flux.py:482
FLUX: Gradient checkpointing enabled.
prepare optimizer, data loader etc.
INFO use 8-bit AdamW optimizer | {} train_util.py:4346
enable fp8 training.
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 105
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 105
num epochs / epoch数: 16
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1600
steps: 0%| | 0/1600 [00:00<?, ?it/s]2024-08-15 20:08:00 INFO unet dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1004
INFO text_encoder dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1006
INFO text_encoder dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1006
epoch 1/16
INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\utils\checkpoint.py:92: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\transformers\models\clip\modeling_clip.py:480: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
File "C:\AI\kohya_flux\sd-scripts\flux_train_network.py", line 397, in <module>
trainer.train(args)
File "C:\AI\kohya_flux\sd-scripts\train_network.py", line 1076, in train
text_encoder_conds = text_encoding_strategy.encode_tokens(
File "C:\AI\kohya_flux\sd-scripts\library\strategy_flux.py", line 74, in encode_tokens
t5_out, _ = t5xxl(t5_tokens.to(t5xxl.device), return_dict=False, output_hidden_states=True)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\transformers\models\t5\modeling_t5.py", line 1971, in forward
encoder_outputs = self.encoder(
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\transformers\models\t5\modeling_t5.py", line 1012, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\modules\sparse.py", line 164, in forward
return F.embedding(
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\torch\nn\functional.py", line 2267, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: "index_select_cuda" not implemented for 'Float8_e4m3fn'
steps: 0%| | 0/1600 [00:00<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "C:\Users\elbar\anaconda3\envs\kohya_flux\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\\Users\\elbar\\anaconda3\\envs\\kohya_flux\\python.exe', 'C:/AI/kohya_flux/sd-scripts/flux_train_network.py', '--config_file', 'D:/ai/models/lora//config_lora-20240815-200634.toml']' returned non-zero exit status 1.
20:08:02-944389 INFO Training has ended.
current settings, in case I changed them since the last one:
ae = "C:/AI/ComfyUI/models/flux/ae.sft"
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
caption_extension = ".txt"
clip_l = "C:/AI/ComfyUI/models/clip/clip_l.safetensors"
clip_skip = 1
discrete_flow_shift = 3
dynamo_backend = "no"
enable_bucket = true
epoch = 12
flux1_cache_text_encoder_outputs = true
flux1_cache_text_encoder_outputs_to_disk = true
fp8_base = true
gradient_accumulation_steps = 1
gradient_checkpointing = true
huber_c = 0.1
huber_schedule = "snr"
loss_type = "l2"
lr_scheduler = "cosine_with_restarts"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
max_bucket_reso = 2048
max_data_loader_n_workers = 0
max_grad_norm = 1
max_timestep = 1000
max_token_length = 75
max_train_steps = 1600
mem_eff_attn = true
min_bucket_reso = 256
mixed_precision = "fp16"
model_prediction_type = "sigma_scaled"
network_alpha = 2
network_args = []
network_dim = 4
network_module = "networks.lora_flux"
network_train_unet_only = true
noise_offset_type = "Original"
optimizer_args = []
optimizer_type = "AdamW8bit"
output_dir = "D:/ai/models/lora/"
output_name = "flux_test1"
pretrained_model_name_or_path = "C:/AI/ComfyUI/models/unet/FLUX1/flux1-dev.safetensors"
prior_loss_weight = 1
resolution = "512,512"
sample_prompts = "D:/ai/models/lora/sample/prompt.txt"
sample_sampler = "euler_a"
save_every_n_epochs = 1
save_model_as = "safetensors"
save_precision = "bf16"
sdpa = true
t5xxl = "C:/AI/ComfyUI/models/t5/t5xxl_fp16.safetensors"
timestep_sampling = "sigma"
train_batch_size = 1
train_data_dir = "C:/AI/training_data/jrpg_character_designs"
unet_lr = 0.0001
wandb_run_name = "flux_test1"
Please use cache_text_encoder_outputs
and cache_text_encoder_outputs_to_disk
, without flux1_
prefix.
That was it! Thank you, and sorry I missed that the first time around. Not only is it working, it significantly reduced VRAM usage.
I've got an RTX 4090 and I'm running the latest commit of the kohya_ss sd3-flux branch on Windows.
Here is my configuration:
Here's the resulting error when I run it:
Any idea how I can get this to work?