kohya-ss / sd-scripts

Apache License 2.0
5.13k stars 855 forks source link

RuntimeError: stack expects each tensor to be equal size, but got [4, 108, 148] at entry 0 and [4, 96, 168] at entry 1 #958

Closed jndietz closed 10 months ago

jndietz commented 11 months ago

I've been trying to train a LoRA, and I'm getting the following error:

steps:   3%|████▎                                                                                                                                                | 63/2175 [03:19<1:51:31,  3.17s/it, Average key norm=0.000497, Keys Scaled=0, avr_loss=0.0661]Traceback (most recent call last):
  File "E:\github\kohya_ss\sdxl_train_network.py", line 185, in <module>
    trainer.train(args)
  File "E:\github\kohya_ss\train_network.py", line 755, in train
    for step, batch in enumerate(train_dataloader):
  File "e:\github\kohya_ss\venv\lib\site-packages\accelerate\data_loader.py", line 394, in __iter__
    next_batch = next(dataloader_iter)
  File "e:\github\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 633, in __next__
    data = self._next_data()
  File "e:\github\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 677, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "e:\github\kohya_ss\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "e:\github\kohya_ss\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "e:\github\kohya_ss\venv\lib\site-packages\torch\utils\data\dataset.py", line 243, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "E:\github\kohya_ss\library\train_util.py", line 1239, in __getitem__
    example["latents"] = torch.stack(latents_list) if latents_list[0] is not None else None
RuntimeError: stack expects each tensor to be equal size, but got [4, 108, 148] at entry 0 and [4, 96, 168] at entry 1
steps:   3%|████▎                                                                                                                                                | 63/2175 [03:20<1:51:46,  3.18s/it, Average key norm=0.000497, Keys Scaled=0, avr_loss=0.0661]
Traceback (most recent call last):
  File "C:\Users\Jared\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Jared\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "E:\github\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "e:\github\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 47, in main
    args.func(args)
  File "e:\github\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 986, in launch_command
    simple_launcher(args)
  File "e:\github\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['e:\\github\\kohya_ss\\venv\\Scripts\\python.exe', './sdxl_train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=E:\\github\\stable-diffusion-webui\\models\\Stable-diffusion\\sdxl\\sd_xl_base_1.0_0.9vae.safetensors', '--train_data_dir=C:\\training\\person-xl\\img', '--resolution=1024,1024', '--output_dir=C:\\training\\person-xl\\lora-output', '--logging_dir=C:\\training\\person-xl\\logging', '--network_alpha=128', '--save_model_as=safetensors', '--network_module=networks.lora', '--unet_lr=1.0', '--network_train_unet_only', '--network_dim=128', '--output_name=person-xl-2.0', '--lr_scheduler_num_cycles=100', '--scale_weight_norms=1', '--network_dropout=0.1', '--cache_text_encoder_outputs', '--no_half_vae', '--lr_scheduler=cosine', '--lr_warmup_steps=218', '--train_batch_size=4', '--max_train_steps=2175', '--save_every_n_epochs=10', '--mixed_precision=bf16', '--save_precision=bf16', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Prodigy', '--optimizer_args', 'weight_decay=0.05', 'betas=0.9,0.98', '--max_data_loader_n_workers=0', '--keep_tokens=1', '--bucket_reso_steps=32', '--min_snr_gamma=5', '--gradient_checkpointing', '--xformers', '--noise_offset=0.0357', '--adaptive_noise_scale=0.00357', '--log_prefix=xl-lora', '--sample_sampler=euler_a', '--sample_prompts=C:\\training\\person-xl\\lora-output\\sample\\prompt.txt', '--sample_every_n_steps=25']' returned non-zero exit status 1.

I have a feeling one of the images is causing the issue. Is there a way to figure out which one?

rockerBOO commented 11 months ago

In https://github.com/kohya-ss/sd-scripts/blob/main/library/train_util.py#L1146

if torch.Size([4, 96, 168]) == latents.size():
   print(image_info.absolute_path)

Something like that.

It looks like it's happening immediately so maybe it's something related to cache_latents and bucket_no_upscale.

kohya-ss commented 10 months ago

One possible reason is there are two or more image files with same base name and different extension, for example "aaa.jpg" and "aaa.png". In this case, the latent cache file has been overwritten and has an illegal shape for one of the image.

Unfortunately, there is no checking in the script, so please verify your dataset.

jndietz commented 10 months ago

You guys were right. Just renamed my files and the training completed. Thank you!

x-name commented 10 months ago

Looks like for SDXL training now exist problem with buckets. Because doesn't work with batch > 1 and when loading buckets ignores bucket_reso_steps.