If I comment out network_args and create resume and load it, it works fine (algo=full specific problem?).
I do a git pull every time because it's in the cloud, but it worked fine about 2 weeks ago.
I'll put up a pip list of logs and two configuration files.
The problem was reproduced even with max_train_steps=1
他の人が参照できるようにdeeplを使ったけど日本語おk
log
root@nsn5cuyse3:/notebooks/learning/artbook_mod# ./run.sh
2023-12-04 21:39:53.087640: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-04 21:39:53.744951: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading settings from /notebooks/learning/artbook_mod/config_file.toml...
/notebooks/learning/artbook_mod/config_file
prepare tokenizer
update token length: 225
Loading dataset config from /notebooks/learning/artbook_mod/dataset_config.toml
prepare images.
found directory /notebooks/learning/artbook_mod/learningImage contains 646 image files
646 train images with repeating.
0 reg images.
no regularization images / 正則化画像が見つかりませんでした
[Dataset 0]
batch_size: 4
resolution: (1024, 1024)
enable_bucket: True
min_bucket_reso: 1024
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True
[Dataset 0]
loading image sizes.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 646/646 [00:00<00:00, 2465.50it/s]
make buckets
min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (832, 1152), count: 279
bucket 1: resolution (832, 1216), count: 253
bucket 2: resolution (1152, 832), count: 92
bucket 3: resolution (1216, 832), count: 22
mean ar error (without repeats): 0.015544934824601562
preparing accelerator
loading model for process 0/1
load StableDiffusion checkpoint: /var/opt/models/anyloraCheckpoint_novaeFp16.safetensors
UNet2DConditionModel: 64, 8, 768, False, False
loading u-net:
loading vae:
loading text encoder:
Enable xformers for U-Net
import network module: lycoris.kohya
[Dataset 0]
caching latents.
checking cache validity...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 646/646 [00:00<00:00, 821.86it/s]
caching latents...
0it [00:00, ?it/s]
Using rank adaptation algo: full
Use Dropout value: 0.0
Create LyCORIS Module
create LyCORIS for Text Encoder: 72 modules.
Create LyCORIS Module
create LyCORIS for U-Net: 282 modules.
module type table: {'FullModule': 354}
enable LyCORIS for text encoder
enable LyCORIS for U-Net
CrossAttnDownBlock2D False -> True
CrossAttnDownBlock2D False -> True
CrossAttnDownBlock2D False -> True
DownBlock2D False -> True
UNetMidBlock2DCrossAttn False -> True
UpBlock2D False -> True
CrossAttnUpBlock2D False -> True
CrossAttnUpBlock2D False -> True
CrossAttnUpBlock2D False -> True
prepare optimizer, data loader etc.
use 8-bit AdamW optimizer | {}
resume training from local state: /notebooks/output/learning/artbook_modern_x3_128_full_AdamW8bit_cosine_1k_5.5h_rate0.05_tag-state
Traceback (most recent call last):
File "/var/opt/sd-scripts/train_network.py", line 1012, in
trainer.train(args)
File "/var/opt/sd-scripts/train_network.py", line 466, in train
train_util.resume_from_local_or_hf_if_specified(accelerator, args)
File "/var/opt/sd-scripts/library/train_util.py", line 3341, in resume_from_local_or_hf_if_specified
accelerator.load_state(args.resume)
File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2938, in load_state
load_accelerator_state(
File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/checkpointing.py", line 159, in load_accelerator_state
models[i].load_state_dict(torch.load(input_model_file, map_location=map_location), **load_model_func_kwargs)
File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LycorisNetwork:
Missing key(s) in state_dict: "lora_te_text_model_encoder_layers_0_self_attn_k_proj.weight", "lora_te_text_model_encoder_layers_0_self_attn_k_proj.bias", "lora_te_text_model_encoder_layers_0_self_attn_v_proj.weight", "lora_te_text_model_encoder_layers_0_self_attn_v_proj.bias", "lora_te_text_model_encoder_layers_0_self_attn_q_proj.weight", "lora_te_text_model_encoder_layers_0_self_attn_q_proj.bias", "lora_te_text_model_encoder_layers_0_self_attn_out_proj.weight", "lora_te_text_model_encoder_layers_0_self_attn_out_proj.bias", "lora_te_text_model_encoder_layers_0_mlp_fc1.weight", "lora_te_text_model_encoder_layers_0_mlp_fc1.bias", "lora_te_text_model_encoder_layers_0_mlp_fc2.weight", "lora_te_text_model_encoder_layers_0_mlp_fc2.bias", "lora_te_text_model_encoder_layers_1_self_attn_k_proj.weight", "lora_te_text_model_encoder_layers_1_self_attn_k_proj.bias", "lora_te_text_model_encoder_layers_1_self_attn_v_proj.weight", "lora_te_text_model_encoder_layers_1_self_attn_v_proj.bias", "lora_te_text_model_encoder_layers_1_self_attn_q_proj.weight", "lora_te_text_model_encoder_layers_1_self_attn_q_proj.bias", "lora_te_text_model_encoder_layers_1_self_attn_out_proj.weight", "lora_te_text_model_encoder_layers_1_self_attn_out_proj.bias", "lora_te_text_model_encoder_layers_1_mlp_fc1.weight", "lora_te_text_model_encoder_layers_1_mlp_fc1.bias", "lora_te_text_model_encoder_layers_1_mlp_fc2.weight", "lora_te_text_model_encoder_layers_1_mlp_fc2.bias", "lora_te_text_model_encoder_layers_2_self_attn_k_proj.weight", "lora_te_text_model_encoder_layers_2_self_attn_k_proj.bias", "lora_te_text_model_encoder_layers_2_self_attn_v_proj.weight", "lora_te_text_model_encoder_layers_2_self_attn_v_proj.bias", "lora_te_text_model_encoder_layers_2_self_attn_q_proj.weight", "lora_te_text_model_encoder_layers_2_self_attn_q_proj.bias", "lora_te_text_model_encoder_layers_2_self_attn_out_proj.weight", "lora_te_text_model_encoder_layers_2_self_attn_out_proj.bias", "lora_te_text_model_encoder_layers_2_mlp_fc1.weight", "lora_te_text_model_encoder_layers_2_mlp_fc1.bias", "lora_te_text_model_encoder_layers_2_mlp_fc2.weight", "lora_te_text_model_encoder_layers_2_mlp_fc2.bias", "lora_te_text_model_encoder_layers_3_self_attn_k_proj.weight", "lora_te_text_model_encoder_layers_3_self_attn_k_proj.bias", "lora_te_text_model_encoder_layers_3_self_attn_v_proj.weight", "lora_te_text_model_encoder_layers_3_self_attn_v_proj.bias", "lora_te_text_model_encoder_layers_3_self_attn_q_proj.weight",
Too many to omit.
"lora_unet_conv_out.diff", "lora_unet_conv_out.diff_b".
Traceback (most recent call last):
File "/var/opt/sd-scripts/./venv/bin/accelerate", line 8, in
sys.exit(main())
File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/var/opt/sd-scripts/venv/bin/python3', 'train_network.py', '--config_file=/notebooks/learning/artbook_mod/config_file.toml', '--dataset_config=/notebooks/learning/artbook_mod/dataset_config.toml']' returned non-zero exit status 1.
pip list
root@nsn5cuyse3:/notebooks/learning/artbook_mod# /var/opt/sd-scripts/venv/bin/pip list
Package Version Editable project location
log
Too many to omit.
pip list
config_file.toml
dataset_config.toml