When using resume, it stops with an error.(algo=full)

sd1.5
It works when resume is removed.
Using network_weight instead works.
If I comment out network_args and create resume and load it, it works fine (algo=full specific problem?).
I do a git pull every time because it's in the cloud, but it worked fine about 2 weeks ago.
I'll put up a pip list of logs and two configuration files.
The problem was reproduced even with max_train_steps=1
他の人が参照できるようにdeeplを使ったけど日本語おｋ

log

root@nsn5cuyse3:/notebooks/learning/artbook_mod# ./run.sh 2023-12-04 21:39:53.087640: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-12-04 21:39:53.744951: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Loading settings from /notebooks/learning/artbook_mod/config_file.toml... /notebooks/learning/artbook_mod/config_file prepare tokenizer update token length: 225 Loading dataset config from /notebooks/learning/artbook_mod/dataset_config.toml prepare images. found directory /notebooks/learning/artbook_mod/learningImage contains 646 image files 646 train images with repeating. 0 reg images. no regularization images / 正則化画像が見つかりませんでした [Dataset 0] batch_size: 4 resolution: (1024, 1024) enable_bucket: True min_bucket_reso: 1024 max_bucket_reso: 2048 bucket_reso_steps: 64 bucket_no_upscale: True

[Subset 0 of Dataset 0] image_dir: "/notebooks/learning/artbook_mod/learningImage" image_count: 646 num_repeats: 1 shuffle_caption: False keep_tokens: 2 caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, is_reg: False class_tokens: None caption_extension: .txt

[Dataset 0] loading image sizes. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 646/646 [00:00<00:00, 2465.50it/s] make buckets min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） bucket 0: resolution (832, 1152), count: 279 bucket 1: resolution (832, 1216), count: 253 bucket 2: resolution (1152, 832), count: 92 bucket 3: resolution (1216, 832), count: 22 mean ar error (without repeats): 0.015544934824601562 preparing accelerator loading model for process 0/1 load StableDiffusion checkpoint: /var/opt/models/anyloraCheckpoint_novaeFp16.safetensors UNet2DConditionModel: 64, 8, 768, False, False loading u-net: loading vae: loading text encoder: Enable xformers for U-Net import network module: lycoris.kohya [Dataset 0] caching latents. checking cache validity... 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 646/646 [00:00<00:00, 821.86it/s] caching latents... 0it [00:00, ?it/s] Using rank adaptation algo: full Use Dropout value: 0.0 Create LyCORIS Module create LyCORIS for Text Encoder: 72 modules. Create LyCORIS Module create LyCORIS for U-Net: 282 modules. module type table: {'FullModule': 354} enable LyCORIS for text encoder enable LyCORIS for U-Net CrossAttnDownBlock2D False -> True CrossAttnDownBlock2D False -> True CrossAttnDownBlock2D False -> True DownBlock2D False -> True UNetMidBlock2DCrossAttn False -> True UpBlock2D False -> True CrossAttnUpBlock2D False -> True CrossAttnUpBlock2D False -> True CrossAttnUpBlock2D False -> True prepare optimizer, data loader etc. use 8-bit AdamW optimizer | {} resume training from local state: /notebooks/output/learning/artbook_modern_x3_128_full_AdamW8bit_cosine_1k_5.5h_rate0.05_tag-state Traceback (most recent call last): File "/var/opt/sd-scripts/train_network.py", line 1012, in trainer.train(args) File "/var/opt/sd-scripts/train_network.py", line 466, in train train_util.resume_from_local_or_hf_if_specified(accelerator, args) File "/var/opt/sd-scripts/library/train_util.py", line 3341, in resume_from_local_or_hf_if_specified accelerator.load_state(args.resume) File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2938, in load_state load_accelerator_state( File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/checkpointing.py", line 159, in load_accelerator_state models[i].load_state_dict(torch.load(input_model_file, map_location=map_location), **load_model_func_kwargs) File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for LycorisNetwork: Missing key(s) in state_dict: "lora_te_text_model_encoder_layers_0_self_attn_k_proj.weight", "lora_te_text_model_encoder_layers_0_self_attn_k_proj.bias", "lora_te_text_model_encoder_layers_0_self_attn_v_proj.weight", "lora_te_text_model_encoder_layers_0_self_attn_v_proj.bias", "lora_te_text_model_encoder_layers_0_self_attn_q_proj.weight", "lora_te_text_model_encoder_layers_0_self_attn_q_proj.bias", "lora_te_text_model_encoder_layers_0_self_attn_out_proj.weight", "lora_te_text_model_encoder_layers_0_self_attn_out_proj.bias", "lora_te_text_model_encoder_layers_0_mlp_fc1.weight", "lora_te_text_model_encoder_layers_0_mlp_fc1.bias", "lora_te_text_model_encoder_layers_0_mlp_fc2.weight", "lora_te_text_model_encoder_layers_0_mlp_fc2.bias", "lora_te_text_model_encoder_layers_1_self_attn_k_proj.weight", "lora_te_text_model_encoder_layers_1_self_attn_k_proj.bias", "lora_te_text_model_encoder_layers_1_self_attn_v_proj.weight", "lora_te_text_model_encoder_layers_1_self_attn_v_proj.bias", "lora_te_text_model_encoder_layers_1_self_attn_q_proj.weight", "lora_te_text_model_encoder_layers_1_self_attn_q_proj.bias", "lora_te_text_model_encoder_layers_1_self_attn_out_proj.weight", "lora_te_text_model_encoder_layers_1_self_attn_out_proj.bias", "lora_te_text_model_encoder_layers_1_mlp_fc1.weight", "lora_te_text_model_encoder_layers_1_mlp_fc1.bias", "lora_te_text_model_encoder_layers_1_mlp_fc2.weight", "lora_te_text_model_encoder_layers_1_mlp_fc2.bias", "lora_te_text_model_encoder_layers_2_self_attn_k_proj.weight", "lora_te_text_model_encoder_layers_2_self_attn_k_proj.bias", "lora_te_text_model_encoder_layers_2_self_attn_v_proj.weight", "lora_te_text_model_encoder_layers_2_self_attn_v_proj.bias", "lora_te_text_model_encoder_layers_2_self_attn_q_proj.weight", "lora_te_text_model_encoder_layers_2_self_attn_q_proj.bias", "lora_te_text_model_encoder_layers_2_self_attn_out_proj.weight", "lora_te_text_model_encoder_layers_2_self_attn_out_proj.bias", "lora_te_text_model_encoder_layers_2_mlp_fc1.weight", "lora_te_text_model_encoder_layers_2_mlp_fc1.bias", "lora_te_text_model_encoder_layers_2_mlp_fc2.weight", "lora_te_text_model_encoder_layers_2_mlp_fc2.bias", "lora_te_text_model_encoder_layers_3_self_attn_k_proj.weight", "lora_te_text_model_encoder_layers_3_self_attn_k_proj.bias", "lora_te_text_model_encoder_layers_3_self_attn_v_proj.weight", "lora_te_text_model_encoder_layers_3_self_attn_v_proj.bias", "lora_te_text_model_encoder_layers_3_self_attn_q_proj.weight",

Too many to omit.

"lora_unet_conv_out.diff", "lora_unet_conv_out.diff_b". Traceback (most recent call last): File "/var/opt/sd-scripts/./venv/bin/accelerate", line 8, in sys.exit(main()) File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 986, in launch_command simple_launcher(args) File "/var/opt/sd-scripts/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/var/opt/sd-scripts/venv/bin/python3', 'train_network.py', '--config_file=/notebooks/learning/artbook_mod/config_file.toml', '--dataset_config=/notebooks/learning/artbook_mod/dataset_config.toml']' returned non-zero exit status 1.

pip list

root@nsn5cuyse3:/notebooks/learning/artbook_mod# /var/opt/sd-scripts/venv/bin/pip list Package Version Editable project location

absl-py 2.0.0 accelerate 0.23.0 aiohttp 3.9.1 aiosignal 1.3.1 altair 4.2.2 astunparse 1.6.3 async-timeout 4.0.3 attrs 23.1.0 bitsandbytes 0.41.1 cachetools 5.3.2 certifi 2022.12.7 charset-normalizer 2.1.1 cmake 3.25.0 diffusers 0.21.2 easygui 0.98.3 einops 0.6.0 entrypoints 0.4 filelock 3.9.0 flatbuffers 23.5.26 frozenlist 1.4.0 fsspec 2023.12.0 ftfy 6.1.1 gast 0.4.0 google-auth 2.24.0 google-auth-oauthlib 1.0.0 google-pasta 0.2.0 grpcio 1.59.3 h5py 3.10.0 huggingface-hub 0.15.1 idna 3.4 importlib-metadata 7.0.0 jax 0.4.20 Jinja2 3.1.2 jsonschema 4.20.0 jsonschema-specifications 2023.11.2 keras 2.12.0 libclang 16.0.6 library 0.0.0 /var/opt/sd-scripts lightning-utilities 0.10.0 lion-pytorch 0.1.2 lit 15.0.7 lycoris-lora 2.0.0 Markdown 3.5.1 MarkupSafe 2.1.3 ml-dtypes 0.3.1 mpmath 1.3.0 multidict 6.0.4 mypy-extensions 1.0.0 networkx 3.0 numpy 1.23.5 oauthlib 3.2.2 open-clip-torch 2.20.0 opencv-python 4.7.0.68 opt-einsum 3.3.0 packaging 23.2 pandas 2.1.3 Pillow 9.3.0 pip 23.0.1 protobuf 4.25.1 psutil 5.9.6 pyasn1 0.5.1 pyasn1-modules 0.3.0 pyre-extensions 0.0.29 python-dateutil 2.8.2 pytorch-lightning 1.9.0 pytz 2023.3.post1 PyYAML 6.0.1 referencing 0.31.1 regex 2023.10.3 requests 2.28.1 requests-oauthlib 1.3.1 rpds-py 0.13.2 rsa 4.9 safetensors 0.3.1 scipy 1.11.4 sentencepiece 0.1.99 setuptools 65.5.0 six 1.16.0 sympy 1.12 tensorboard 2.12.3 tensorboard-data-server 0.7.2 tensorboard-plugin-wit 1.8.1 tensorflow 2.12.0 tensorflow-estimator 2.12.0 tensorflow-io-gcs-filesystem 0.34.0 termcolor 2.4.0 timm 0.9.12 tokenizers 0.13.3 toml 0.10.2 toolz 0.12.0 torch 2.0.1+cu118 torchmetrics 1.2.1 torchvision 0.15.2+cu118 tqdm 4.66.1 transformers 4.30.2 triton 2.0.0 typing_extensions 4.4.0 typing-inspect 0.9.0 tzdata 2023.3 urllib3 1.26.13 voluptuous 0.13.1 wcwidth 0.2.12 Werkzeug 3.0.1 wheel 0.42.0 wrapt 1.14.1 xformers 0.0.20 yarl 1.9.3 zipp 3.17.0

config_file.toml

[model_arguments] pretrained_model_name_or_path = "/notebooks/share/models/anyloraCheckpoint_novaeFp16.safetensors" v_parameterization = false

[additional_network_arguments] unet_lr = 2.0963939965449369126881481458617e-6 text_encoder_lr = 5.240984991362342281720370364654e-7

network_module = "lycoris.kohya" network_dim = 128 network_alpha = 64 network_args = ["algo=full", "conv_dim=128", "conv_alpha=64"]

[optimizer_arguments] optimizer_type = "AdamW8bit" lr_scheduler = "cosine"

[dataset_arguments] cache_latents = true cache_latents_to_disk = true debug_dataset = false

[training_arguments] output_dir = "/notebooks/output/learning" output_name = "artbook_modern_x2_128_full_AdamW8bit_cosine_1k_5.5h_rate0.05_tag" max_train_steps = 6018

resume = "/notebooks/output/learning/artbook_modern_x3_128_full_AdamW8bit_cosine_1k_5.5h_rate0.05_tag-state" save_state = true

max_token_length = 225 noise_offset = 0 xformers = true max_data_loader_n_workers = 8 persistent_data_loader_workers = true save_precision = "bf16" mixed_precision = "bf16" clip_skip = 2 gradient_checkpointing = true

[dreambooth_arguments] prior_loss_weight = 1.0

[saving_arguments] save_model_as = "safetensors"

dataset_config.toml

[general] caption_extension = '.txt' enable_bucket = true bucket_no_upscale =true min_bucket_reso = 1024 max_bucket_reso = 2048

[[datasets]] keep_tokens = 2 resolution = 1024 batch_size = 4

[[datasets.subsets]] image_dir = '/notebooks/learning/artbook_mod/learningImage' num_repeats = 1

kohya-ss / sd-scripts

When using resume, it stops with an error.(algo=full) #982

log

pip list

config_file.toml

dataset_config.toml