Open jacquesfeng123 opened 1 year ago
If you are using a metadata .json
file, you may get this error if you change the resolution in the dataset config after creating the metadata. Please delete the *.npz
file and re-run prepare_bucket_latents.py
.
If you are training without the metadata file, could you please share your training settings?
If you are using a metadata
.json
file, you may get this error if you change the resolution in the dataset config after creating the metadata. Please delete the*.npz
file and re-runprepare_bucket_latents.py
.If you are training without the metadata file, could you please share your training settings?
Hey thanks for replying Koyha, Ive put it aside for a week, sorry for late reply.
I have followed by deleting all npz files and redone the prepare_bucket, but this is integrated inside balmaits's process.
it still returns the same problem.
Here is my settings:
{ "pretrained_model_name_or_path": "stabilityai/stable-diffusion-2-1", "v2": true, "v_parameterization": true, "train_dir": "D:/SS_kohya/111restart/config", "image_folder": "D:/captioning/xxxx", "output_dir": "D:/SS_kohya/111restart/model_output", "logging_dir": "D:/SS_kohya/111restart/log", "max_resolution": "768,768", "min_bucket_reso": "448", "max_bucket_reso": "1280", "batch_size": "16", "flip_aug": false, "caption_metadata_filename": "meta_cap_try1.json", "latent_metadata_filename": "meta_lat_try1.json", "full_path": false, "learning_rate": 5e-08, "lr_scheduler": "cosine", "lr_warmup": 0, "dataset_repeats": "1", "train_batch_size": 16, "epoch": 50, "save_every_n_epochs": 1, "mixed_precision": "bf16", "save_precision": "bf16", "seed": "", "num_cpu_threads_per_process": 24, "train_text_encoder": false, "create_caption": true, "create_buckets": true, "save_model_as": "safetensors", "caption_extension": ".txt", "xformers": false, "clip_skip": 2, "save_state": true, "resume": "", "gradient_checkpointing": true, "gradient_accumulation_steps": 6.0, "mem_eff_attn": true, "shuffle_caption": true, "output_name": "xxxx", "max_token_length": "150", "max_train_epochs": "", "max_data_loader_n_workers": "", "full_fp16": false, "color_aug": false, "model_list": "stabilityai/stable-diffusion-2-1", "cache_latents": true, "cache_latents_to_disk": false, "use_latent_files": "Yes", "keep_tokens": 0, "persistent_data_loader_workers": true, "bucket_no_upscale": false, "random_crop": false, "bucket_reso_steps": 64.0, "caption_dropout_every_n_epochs": 0.0, "caption_dropout_rate": 0.05, "optimizer": "AdamW8bit", "optimizer_args": "", "noise_offset_type": "Original", "noise_offset": 0.05, "adaptive_noise_scale": 0, "multires_noise_iterations": 0, "multires_noise_discount": 0, "sample_every_n_steps": 500, "sample_every_n_epochs": 1, "sample_sampler": "euler_a", "sample_prompts": "xxxx", "additional_parameters": "", "vae_batch_size": 16, "min_snr_gamma": 0, "weighted_captions": false, "save_every_n_steps": 1500, "save_last_n_steps": 0, "save_last_n_steps_state": 0, "use_wandb": "", "wandb_api_key": "False" }
just to clarify I don't get this error if I train 10 images.
but get this when I choose all the images, feels like a data input problem.
I simply click the start training in Balmaits repo using UI, so the process is probably internal. Therefore I don't touch the progress at all. I should not have changed resolution in data set.
I suspect its an issue from switching between buckets or shuffling?
Thank you for sharing the settings. The settings seem to be ok.
I too am considering the possibility of potential problems with bucketing. I will investigate. Could you please share the complete stack trace of the error?
Also, just to confirm. Would this problem still occur if you set the batch size to 1?
below is the complete log after starting to train, I will try to run again with batch size = 1 and get back with you later
found 18660 images. new metadata will be created / 新しいメタデータファイルが作成されます merge caption texts to metadata json. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18660/18660 [00:01<00:00, 11361.83it/s] writing metadata: D:/SS_kohya/111restart/config/meta_cap_try3.json done! ./venv/Scripts/python.exe finetune/prepare_buckets_latents.py "D:/captioning/Architecture_for_ss_kohya" "D:/SS_kohya/111restart/config/meta_cap_try3.json" "D:/SS_kohya/111restart/config/meta_lat_try3.json" "stabilityai/stable-diffusion-2-1" --batch_size=12 --max_resolution=768,768 --min_bucket_reso=448 --max_bucket_reso=1280 --mixed_precision=fp16 found 18660 images. loading existing metadata: D:/SS_kohya/111restart/config/meta_cap_try3.json load VAE: stabilityai/stable-diffusion-2-1 exception occurs in loading vae: stabilityai/stable-diffusion-2-1 does not appear to have a file named diffusion_pytorch_model.bin. retry with subfolder='vae' C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\safetensors\torch.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() with safe_open(filename, framework="pt", device=device) as f: C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() storage = cls(wrap_storage=untyped_storage) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18660/18660 [21:02<00:00, 14.78it/s] bucket 0 (448, 1216): 6 bucket 1 (512, 1088): 17 bucket 2 (512, 1152): 4 bucket 3 (576, 960): 188 bucket 4 (576, 1024): 83 bucket 5 (640, 896): 2348 bucket 6 (704, 832): 854 bucket 7 (768, 768): 880 bucket 8 (832, 704): 962 bucket 9 (896, 640): 7567 bucket 10 (960, 576): 2016 bucket 11 (1024, 576): 2511 bucket 12 (1088, 512): 771 bucket 13 (1152, 512): 302 bucket 14 (1216, 448): 110 bucket 15 (1280, 448): 41 mean ar error: 0.06202298430829582 writing metadata: D:/SS_kohya/111restart/config/meta_lat_try3.json done! image_num = 18660 repeats = 18660 max_train_steps = 19438 lr_warmup_steps = 0 accelerate launch --num_cpu_threads_per_process=24 "./fine_tune.py" --v2 --v_parameterization --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" --in_json="D:/SS_kohya/111restart/config/meta_lat_try3.json" --train_data_dir="D:/captioning/Architecture_for_ss_kohya" --output_dir="D:/SS_kohya/111restart/model_output" --logging_dir="D:/SS_kohya/111restart/log" --dataset_repeats=1 --learning_rate=5e-08 --enable_bucket --resolution=768,768 --min_bucket_reso=448 --max_bucket_reso=1280 --save_model_as=safetensors --gradient_accumulation_steps=4 --output_name="AIRI_aube_edu_resi_test1_settings_search_restart" --max_token_length=150 --learning_rate="5e-08" --lr_scheduler="cosine" --train_batch_size="12" --max_train_steps="19438" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="AdamW8bit" --max_token_length=150 --clip_skip=2 --caption_dropout_rate="0.05" --vae_batch_size="16" --bucket_reso_steps=64 --save_every_n_steps="1500" --save_state --mem_eff_attn --shuffle_caption --gradient_checkpointing --persistent_data_loader_workers --noise_offset=0.05 --wandb_api_key="False" --sample_sampler=euler_a --sample_prompts="D:/SS_kohya/111restart/model_output\sample\prompt.txt" --sample_every_n_epochs="1" --sample_every_n_steps="500" v2 with clip_skip will be unexpected / v2でclip_skipを使用することは想定されていません prepare tokenizer update token length: 150 loading existing metadata: D:/SS_kohya/111restart/config/meta_lat_try3.json using bucket info in metadata / メタデータ内のbucket情報を使います [Dataset 0] batch_size: 12 resolution: (768, 768) enable_bucket: True min_bucket_reso: None max_bucket_reso: None bucket_reso_steps: None bucket_no_upscale: None
[Subset 0 of Dataset 0] image_dir: "D:/captioning/Architecture_for_ss_kohya" image_count: 18650 num_repeats: 1 shuffle_caption: True keep_tokens: 0 caption_dropout_rate: 0.05 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, metadata_file: D:/SS_kohya/111restart/config/meta_lat_try3.json
[Dataset 0]
loading image sizes.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18650/18650 [00:00<00:00, 6237442.76it/s]
make buckets
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (448, 1216), count: 6
bucket 1: resolution (512, 1088), count: 17
bucket 2: resolution (512, 1152), count: 4
bucket 3: resolution (576, 960), count: 188
bucket 4: resolution (576, 1024), count: 83
bucket 5: resolution (640, 896), count: 2348
bucket 6: resolution (704, 832), count: 854
bucket 7: resolution (768, 768), count: 879
bucket 8: resolution (832, 704), count: 960
bucket 9: resolution (896, 640), count: 7562
bucket 10: resolution (960, 576), count: 2014
bucket 11: resolution (1024, 576), count: 2511
bucket 12: resolution (1088, 512), count: 771
bucket 13: resolution (1152, 512), count: 302
bucket 14: resolution (1216, 448), count: 110
bucket 15: resolution (1280, 448), count: 41
mean ar error (without repeats): 0.0
prepare accelerator
C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_dir
is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir
instead.
warnings.warn(
Using accelerator 0.15.0 or above.
loading model for process 0/1
load Diffusers pretrained models: stabilityai/stable-diffusion-2-1
text_encoder\model.safetensors not found
Fetching 16 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 16047.07it/s]
C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\transformers\modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(checkpoint_file, framework="pt") as f:
C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
storage = cls(wrap_storage=untyped_storage)
C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\safetensors\torch.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
with safe_open(filename, framework="pt", device=device) as f:
Disable Diffusers' xformers
CrossAttention.forward has been replaced to FlashAttention (not xformers)
[Dataset 0]
caching latents.
0it [00:00, ?it/s]
prepare optimizer, data loader etc.
CUDA SETUP: Loading binary C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
use 8-bit AdamW optimizer | {}
running training / 学習開始
num examples / サンプル数: 18650
num batches per epoch / 1epochのバッチ数: 1563
num epochs / epoch数: 50
batch size per device / バッチサイズ: 12
total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 48
gradient accumulation steps / 勾配を合計するステップ数 = 4
total optimization steps / 学習ステップ数: 19438
steps: 0%| | 0/19438 [00:00<?, ?it/s]
epoch 1/50
C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
steps: 0%|▋ | 82/19438 [27:03<106:26:08, 19.80s/it, loss=nan]Traceback (most recent call last):
File "C:\Users\the beast.AUBE4\kohya_ss\fine_tune.py", line 468, in
steps: 0%|▋ | 82/19438 [27:03<106:28:05, 19.80s/it, loss=nan]
Traceback (most recent call last):
File "C:\Users\the beast.AUBE4\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\the beast.AUBE4\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
exec(code, run_globals)
File "C:\Users\the beast.AUBE4\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in
I can confirm after second epoch, that error no longer persists
Thank you for the complete log. Since the problem does not occur with batch size 1, it seems that there is a potential problem with bucketing. I will investigate around that area.
thanks my man!
I got the same problem, in my case it was because there were two files with the same name but different formats in the dataset, ex: abc.png and abc.jpg
Thank you for sharing the settings. The settings seem to be ok.
I too am considering the possibility of potential problems with bucketing. I will investigate. Could you please share the complete stack trace of the error?
Also, just to confirm. Would this problem still occur if you set the batch size to 1?
Looks like for SDXL training now exist problem with buckets. Because doesn't work with batch > 1 and when loading buckets ignores bucket_reso_steps.
I got this error and like a previous commenter, I found that I had images with the same name but different extensions (jpg and png) that were trying to share the same NPZ file. After renaming the image files and deleting the NPZ files, I was able to train without the error.
I got this error and like a previous commenter, I found that I had images with the same name but different extensions (jpg and png) that were trying to share the same NPZ file. After renaming the image files and deleting the NPZ files, I was able to train without the error.
i can confirm this, also happend to me recently and it was fixed by doing this. thx
stack expects each tensor to be equal size, but got [4, 72, 120] at entry 0 and [4, 80, 112] at entry 2
I keep on getting this regardless how I set up batch size, gradient accumulation etc.
It goes aways when I have very few images, but at my current 5K + images, it has been a few days and frustrating.
Please help.
I am using more recent commit.