sshh12 / terrain-diffusion

MIT License
17 stars 2 forks source link

RuntimeError: stack expects each tensor to be equal size at train_dataloader #4

Open fatemehtd opened 6 months ago

fatemehtd commented 6 months ago

I was trying to run training script, but in training loop once train dataloader is called it gives the RuntimeError. I am training on 8 GPUs and the sizes in each is different, even different attempts results in different dimensions. Could you please let me know how to fix this error?

Resolving data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12365/12365 [00:00<00:00, 19870.97it/s] 02/06/2024 23:19:07 - INFO - main - Running training 02/06/2024 23:19:07 - INFO - main - Num examples = 12302 02/06/2024 23:19:07 - INFO - main - Num Epochs = 100 02/06/2024 23:19:07 - INFO - main - Instantaneous batch size per device = 4 02/06/2024 23:19:07 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 128 02/06/2024 23:19:07 - INFO - main - Gradient Accumulation steps = 4 02/06/2024 23:19:07 - INFO - main - Total optimization steps = 9700 Steps: 0%| | 0/9700 [00:00<?, ?it/s]Traceback (most recent call last): File "../train_text_to_image_lora_sd2_inpaint.py", line 1320, in main() File "../train_text_to_image_lora_sd2_inpaint.py", line 1047, in main for step, batch in enumerate(train_dataloader): File "../accelerate/data_loader.py", line 448, in iter current_batch = next(dataloader_iter)

File ".../train_text_to_image_lora_sd2_inpaint.py", line 934, in collate_fn pixel_values = _collate_imgs([example["pixel_values"] for example in examples])

File "..train_text_to_image_lora_sd2_inpaint.py", line 930, in _collate_imgs vals = torch.stack(vals)

[in each of 8 GPUs the size in error message is different] RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 770] at entry 0 and [3, 512, 768] at entry 1 RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 682] at entry 0 and [3, 512, 768] at entry 1 RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 768] at entry 0 and [3, 512, 771] at entry 1 RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 768] at entry 0 and [3, 725, 512] at entry 1 RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 780] at entry 0 and [3, 512, 663] at entry 1 RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 768] at entry 0 and [3, 512, 767] at entry 1 RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 771] at entry 0 and [3, 512, 773] at entry 1 RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 910] at entry 0 and [3, 512, 767] at entry 1

sshh12 commented 6 months ago

The most likely culprit is that your images are of different sizes. In theory the script should automatically resize to 512x512 but I would try preprocessing them into 512x512 ahead of time to see if that fixes it.

I will also add that I've never tested this on a multi-GPU setup before so potentially other parts may not be supported.