kohya-ss / sd-scripts

Apache License 2.0
4.96k stars 834 forks source link

stack expects each tensor to be equal size #522

Open jacquesfeng123 opened 1 year ago

jacquesfeng123 commented 1 year ago

stack expects each tensor to be equal size, but got [4, 72, 120] at entry 0 and [4, 80, 112] at entry 2

I keep on getting this regardless how I set up batch size, gradient accumulation etc.

It goes aways when I have very few images, but at my current 5K + images, it has been a few days and frustrating.

Please help.

I am using more recent commit.

kohya-ss commented 1 year ago

If you are using a metadata .json file, you may get this error if you change the resolution in the dataset config after creating the metadata. Please delete the *.npz file and re-run prepare_bucket_latents.py.

If you are training without the metadata file, could you please share your training settings?

jacquesfeng123 commented 1 year ago

If you are using a metadata .json file, you may get this error if you change the resolution in the dataset config after creating the metadata. Please delete the *.npz file and re-run prepare_bucket_latents.py.

If you are training without the metadata file, could you please share your training settings?

Hey thanks for replying Koyha, Ive put it aside for a week, sorry for late reply.

I have followed by deleting all npz files and redone the prepare_bucket, but this is integrated inside balmaits's process.

it still returns the same problem.

Here is my settings:

{ "pretrained_model_name_or_path": "stabilityai/stable-diffusion-2-1", "v2": true, "v_parameterization": true, "train_dir": "D:/SS_kohya/111restart/config", "image_folder": "D:/captioning/xxxx", "output_dir": "D:/SS_kohya/111restart/model_output", "logging_dir": "D:/SS_kohya/111restart/log", "max_resolution": "768,768", "min_bucket_reso": "448", "max_bucket_reso": "1280", "batch_size": "16", "flip_aug": false, "caption_metadata_filename": "meta_cap_try1.json", "latent_metadata_filename": "meta_lat_try1.json", "full_path": false, "learning_rate": 5e-08, "lr_scheduler": "cosine", "lr_warmup": 0, "dataset_repeats": "1", "train_batch_size": 16, "epoch": 50, "save_every_n_epochs": 1, "mixed_precision": "bf16", "save_precision": "bf16", "seed": "", "num_cpu_threads_per_process": 24, "train_text_encoder": false, "create_caption": true, "create_buckets": true, "save_model_as": "safetensors", "caption_extension": ".txt", "xformers": false, "clip_skip": 2, "save_state": true, "resume": "", "gradient_checkpointing": true, "gradient_accumulation_steps": 6.0, "mem_eff_attn": true, "shuffle_caption": true, "output_name": "xxxx", "max_token_length": "150", "max_train_epochs": "", "max_data_loader_n_workers": "", "full_fp16": false, "color_aug": false, "model_list": "stabilityai/stable-diffusion-2-1", "cache_latents": true, "cache_latents_to_disk": false, "use_latent_files": "Yes", "keep_tokens": 0, "persistent_data_loader_workers": true, "bucket_no_upscale": false, "random_crop": false, "bucket_reso_steps": 64.0, "caption_dropout_every_n_epochs": 0.0, "caption_dropout_rate": 0.05, "optimizer": "AdamW8bit", "optimizer_args": "", "noise_offset_type": "Original", "noise_offset": 0.05, "adaptive_noise_scale": 0, "multires_noise_iterations": 0, "multires_noise_discount": 0, "sample_every_n_steps": 500, "sample_every_n_epochs": 1, "sample_sampler": "euler_a", "sample_prompts": "xxxx", "additional_parameters": "", "vae_batch_size": 16, "min_snr_gamma": 0, "weighted_captions": false, "save_every_n_steps": 1500, "save_last_n_steps": 0, "save_last_n_steps_state": 0, "use_wandb": "", "wandb_api_key": "False" }

jacquesfeng123 commented 1 year ago

just to clarify I don't get this error if I train 10 images.

but get this when I choose all the images, feels like a data input problem.

I simply click the start training in Balmaits repo using UI, so the process is probably internal. Therefore I don't touch the progress at all. I should not have changed resolution in data set.

I suspect its an issue from switching between buckets or shuffling?

kohya-ss commented 1 year ago

Thank you for sharing the settings. The settings seem to be ok.

I too am considering the possibility of potential problems with bucketing. I will investigate. Could you please share the complete stack trace of the error?

Also, just to confirm. Would this problem still occur if you set the batch size to 1?

jacquesfeng123 commented 1 year ago

below is the complete log after starting to train, I will try to run again with batch size = 1 and get back with you later

found 18660 images. new metadata will be created / 新しいメタデータファイルが作成されます merge caption texts to metadata json. 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18660/18660 [00:01<00:00, 11361.83it/s] writing metadata: D:/SS_kohya/111restart/config/meta_cap_try3.json done! ./venv/Scripts/python.exe finetune/prepare_buckets_latents.py "D:/captioning/Architecture_for_ss_kohya" "D:/SS_kohya/111restart/config/meta_cap_try3.json" "D:/SS_kohya/111restart/config/meta_lat_try3.json" "stabilityai/stable-diffusion-2-1" --batch_size=12 --max_resolution=768,768 --min_bucket_reso=448 --max_bucket_reso=1280 --mixed_precision=fp16 found 18660 images. loading existing metadata: D:/SS_kohya/111restart/config/meta_cap_try3.json load VAE: stabilityai/stable-diffusion-2-1 exception occurs in loading vae: stabilityai/stable-diffusion-2-1 does not appear to have a file named diffusion_pytorch_model.bin. retry with subfolder='vae' C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\safetensors\torch.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() with safe_open(filename, framework="pt", device=device) as f: C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() storage = cls(wrap_storage=untyped_storage) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18660/18660 [21:02<00:00, 14.78it/s] bucket 0 (448, 1216): 6 bucket 1 (512, 1088): 17 bucket 2 (512, 1152): 4 bucket 3 (576, 960): 188 bucket 4 (576, 1024): 83 bucket 5 (640, 896): 2348 bucket 6 (704, 832): 854 bucket 7 (768, 768): 880 bucket 8 (832, 704): 962 bucket 9 (896, 640): 7567 bucket 10 (960, 576): 2016 bucket 11 (1024, 576): 2511 bucket 12 (1088, 512): 771 bucket 13 (1152, 512): 302 bucket 14 (1216, 448): 110 bucket 15 (1280, 448): 41 mean ar error: 0.06202298430829582 writing metadata: D:/SS_kohya/111restart/config/meta_lat_try3.json done! image_num = 18660 repeats = 18660 max_train_steps = 19438 lr_warmup_steps = 0 accelerate launch --num_cpu_threads_per_process=24 "./fine_tune.py" --v2 --v_parameterization --pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" --in_json="D:/SS_kohya/111restart/config/meta_lat_try3.json" --train_data_dir="D:/captioning/Architecture_for_ss_kohya" --output_dir="D:/SS_kohya/111restart/model_output" --logging_dir="D:/SS_kohya/111restart/log" --dataset_repeats=1 --learning_rate=5e-08 --enable_bucket --resolution=768,768 --min_bucket_reso=448 --max_bucket_reso=1280 --save_model_as=safetensors --gradient_accumulation_steps=4 --output_name="AIRI_aube_edu_resi_test1_settings_search_restart" --max_token_length=150 --learning_rate="5e-08" --lr_scheduler="cosine" --train_batch_size="12" --max_train_steps="19438" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="AdamW8bit" --max_token_length=150 --clip_skip=2 --caption_dropout_rate="0.05" --vae_batch_size="16" --bucket_reso_steps=64 --save_every_n_steps="1500" --save_state --mem_eff_attn --shuffle_caption --gradient_checkpointing --persistent_data_loader_workers --noise_offset=0.05 --wandb_api_key="False" --sample_sampler=euler_a --sample_prompts="D:/SS_kohya/111restart/model_output\sample\prompt.txt" --sample_every_n_epochs="1" --sample_every_n_steps="500" v2 with clip_skip will be unexpected / v2でclip_skipを使用することは想定されていません prepare tokenizer update token length: 150 loading existing metadata: D:/SS_kohya/111restart/config/meta_lat_try3.json using bucket info in metadata / メタデータ内のbucket情報を使います [Dataset 0] batch_size: 12 resolution: (768, 768) enable_bucket: True min_bucket_reso: None max_bucket_reso: None bucket_reso_steps: None bucket_no_upscale: None

[Subset 0 of Dataset 0] image_dir: "D:/captioning/Architecture_for_ss_kohya" image_count: 18650 num_repeats: 1 shuffle_caption: True keep_tokens: 0 caption_dropout_rate: 0.05 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, metadata_file: D:/SS_kohya/111restart/config/meta_lat_try3.json

[Dataset 0] loading image sizes. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18650/18650 [00:00<00:00, 6237442.76it/s] make buckets number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) bucket 0: resolution (448, 1216), count: 6 bucket 1: resolution (512, 1088), count: 17 bucket 2: resolution (512, 1152), count: 4 bucket 3: resolution (576, 960), count: 188 bucket 4: resolution (576, 1024), count: 83 bucket 5: resolution (640, 896), count: 2348 bucket 6: resolution (704, 832), count: 854 bucket 7: resolution (768, 768), count: 879 bucket 8: resolution (832, 704), count: 960 bucket 9: resolution (896, 640), count: 7562 bucket 10: resolution (960, 576), count: 2014 bucket 11: resolution (1024, 576), count: 2511 bucket 12: resolution (1088, 512), count: 771 bucket 13: resolution (1152, 512), count: 302 bucket 14: resolution (1216, 448), count: 110 bucket 15: resolution (1280, 448), count: 41 mean ar error (without repeats): 0.0 prepare accelerator C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead. warnings.warn( Using accelerator 0.15.0 or above. loading model for process 0/1 load Diffusers pretrained models: stabilityai/stable-diffusion-2-1 text_encoder\model.safetensors not found Fetching 16 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00, 16047.07it/s] C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\transformers\modeling_utils.py:402: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() with safe_open(checkpoint_file, framework="pt") as f: C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() storage = cls(wrap_storage=untyped_storage) C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\safetensors\torch.py:98: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() with safe_open(filename, framework="pt", device=device) as f: Disable Diffusers' xformers CrossAttention.forward has been replaced to FlashAttention (not xformers) [Dataset 0] caching latents. 0it [00:00, ?it/s] prepare optimizer, data loader etc.

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

CUDA SETUP: Loading binary C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll... use 8-bit AdamW optimizer | {} running training / 学習開始 num examples / サンプル数: 18650 num batches per epoch / 1epochのバッチ数: 1563 num epochs / epoch数: 50 batch size per device / バッチサイズ: 12 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 48 gradient accumulation steps / 勾配を合計するステップ数 = 4 total optimization steps / 学習ステップ数: 19438 steps: 0%| | 0/19438 [00:00<?, ?it/s] epoch 1/50 C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") steps: 0%|▋ | 82/19438 [27:03<106:26:08, 19.80s/it, loss=nan]Traceback (most recent call last): File "C:\Users\the beast.AUBE4\kohya_ss\fine_tune.py", line 468, in train(args) File "C:\Users\the beast.AUBE4\kohya_ss\fine_tune.py", line 276, in train for step, batch in enumerate(train_dataloader): File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\accelerate\data_loader.py", line 388, in iter next_batch = next(dataloader_iter) File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 634, in next data = self._next_data() File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1326, in _next_data return self._process_data(data) File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1372, in _process_data data.reraise() File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch_utils.py", line 644, in reraise raise exception RuntimeError: Caught RuntimeError in DataLoader worker process 3. Original Traceback (most recent call last): File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\data_utils\worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\data_utils\fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\torch\utils\data\dataset.py", line 243, in getitem return self.datasets[dataset_idx][sample_idx] File "C:\Users\the beast.AUBE4\kohya_ss\library\train_util.py", line 1000, in getitem example["latents"] = torch.stack(latents_list) if latents_list[0] is not None else None RuntimeError: stack expects each tensor to be equal size, but got [4, 80, 112] at entry 0 and [4, 72, 120] at entry 8

steps: 0%|▋ | 82/19438 [27:03<106:28:05, 19.80s/it, loss=nan] Traceback (most recent call last): File "C:\Users\the beast.AUBE4\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\the beast.AUBE4\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\the beast.AUBE4\kohya_ss\venv\Scripts\accelerate.exe__main__.py", line 7, in File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main args.func(args) File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 923, in launch_command simple_launcher(args) File "C:\Users\the beast.AUBE4\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 579, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\Users\the beast.AUBE4\kohya_ss\venv\Scripts\python.exe', './fine_tune.py', '--v2', '--v_parameterization', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1', '--in_json=D:/SS_kohya/111restart/config/meta_lat_try3.json', '--train_data_dir=D:/captioning/Architecture_for_ss_kohya', '--output_dir=D:/SS_kohya/111restart/model_output', '--logging_dir=D:/SS_kohya/111restart/log', '--dataset_repeats=1', '--learning_rate=5e-08', '--enable_bucket', '--resolution=768,768', '--min_bucket_reso=448', '--max_bucket_reso=1280', '--save_model_as=safetensors', '--gradient_accumulation_steps=4', '--output_name=AIRI_aube_edu_resi_test1_settings_search_restart', '--max_token_length=150', '--learning_rate=5e-08', '--lr_scheduler=cosine', '--train_batch_size=12', '--max_train_steps=19438', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=AdamW8bit', '--max_token_length=150', '--clip_skip=2', '--caption_dropout_rate=0.05', '--vae_batch_size=16', '--bucket_reso_steps=64', '--save_every_n_steps=1500', '--save_state', '--mem_eff_attn', '--shuffle_caption', '--gradient_checkpointing', '--persistent_data_loader_workers', '--noise_offset=0.05', '--wandb_api_key=False', '--sample_sampler=euler_a', '--sample_prompts=D:/SS_kohya/111restart/model_output\sample\prompt.txt', '--sample_every_n_epochs=1', '--sample_every_n_steps=500']' returned non-zero exit status 1.

jacquesfeng123 commented 1 year ago

I can confirm after second epoch, that error no longer persists

kohya-ss commented 1 year ago

Thank you for the complete log. Since the problem does not occur with batch size 1, it seems that there is a potential problem with bucketing. I will investigate around that area.

jacquesfeng123 commented 1 year ago

thanks my man!

slashedstar commented 1 year ago

I got the same problem, in my case it was because there were two files with the same name but different formats in the dataset, ex: abc.png and abc.jpg

x-name commented 9 months ago

Thank you for sharing the settings. The settings seem to be ok.

I too am considering the possibility of potential problems with bucketing. I will investigate. Could you please share the complete stack trace of the error?

Also, just to confirm. Would this problem still occur if you set the batch size to 1?

Looks like for SDXL training now exist problem with buckets. Because doesn't work with batch > 1 and when loading buckets ignores bucket_reso_steps.

tomhuze commented 8 months ago

I got this error and like a previous commenter, I found that I had images with the same name but different extensions (jpg and png) that were trying to share the same NPZ file. After renaming the image files and deleting the NPZ files, I was able to train without the error.

DEX-1101 commented 7 months ago

I got this error and like a previous commenter, I found that I had images with the same name but different extensions (jpg and png) that were trying to share the same NPZ file. After renaming the image files and deleting the NPZ files, I was able to train without the error.

i can confirm this, also happend to me recently and it was fixed by doing this. thx