RuntimeError: einsum() operand subscript must be in range [a, z] but found C for operand 1

LiuTingWed commented 1 year ago

when I run this code, X = torch.einsum("nclw, ncC -> nclw", X, KQ) A issue was occurred as the title describes

Any tips?

gmongaras commented 1 year ago

Thanks for letting me know! Are you using an older version of PyTorch? I think einsum used to be limited to lowercase characters as this GitHub issue shows: https://github.com/pytorch/pytorch/issues/21412

I just pushed an update to the main branch that only uses lowercase letters which should fix the issue. However, I plan to make several changes today that expand the README and change train.py and infer.py to take in command line arguments, then merge these changes into main which may cause some conflict issues if you are using the code at the moment.

LiuTingWed commented 1 year ago

Thanks to reply. I simply replace the "C" to lower case letters "c", then it works. By the way. Does DDIM need large GPU memory? I used two 2080Ti to run this project, but it still reminds me out of memory unless decrease the num_res_blocks to 2.

gmongaras commented 1 year ago

Changing capital C to lowercase c will probably run into errors since the einsum will do the multiplication incorrectly. Try changing it from capital C to lowercase d: X = torch.einsum("nclw, ncd -> nclw", X, KQ) I also pushed code to the main branch which should have fixed the issue.

As for the GPU issue, training a model takes a lot of GPU memory. For example, training a model with a batch size of 32 with blocks types of ["res", "conv", "clsAtn", "atn", "chnAtn"] and the default parameters takes about 18.5 GB of VRAM on a single GPU. If you reduce the batch size, there may be too much noise in the training process leading to poor training performance. To combat the small batch size, you can always increase the numSteps parameter. This parameter allows you to take multiple steps before updating the model and emulates a larger batch size, but requires numSteps number of backward passes which increases the time to train by numSteps times. You can also try reducing the feature size (embCh) which should help fit the model on your machine.

I trained some pre-trained models with 8 GPUs with a batch size of 128 which is about a total batch size of 1024. Even at this scale, the model still took about 8 days to train to 600K steps and it could've probably trained for longer. I'm worried that on 2 2080Tis, the model will not be able to train quickly enough.

I provide some checkpoints for the pre-trained models. You may be able to fine-tune or continue training one of these checkpoints on your system since I provide some of the optimizer parameters allowing for a seamless restart in training.

You may also be able to change the image size from 64x64 to 32x32 which should drastically reduce memory and drastically reduce the model size needed to model that data. I haven't tried this out, so it may take a little code fiddling to get it working.

If you are looking to just generate images, then you can use infer.py. Inference takes a lot less memory since the model is already trained and it's only producing a single image. On my system, it takes 3.3 GB of VRAM.

Hope this helps!

LiuTingWed commented 1 year ago

Thanks a lot! This reply is so detailed , which is ever receive like this before. GOOD JOB！

LiuTingWed commented 1 year ago

Sir, I read your blog about DDPM. I am a fans of you now and your blog teach me a lot. But I get some problem when I move this project to segmentation task. Still 2 2080ti, datasize=10000+, inputsize=256x256, ep=30, t=30. The performance is too bed, the output totally is noise when the inference is done. But I Found the train loss value is weird like this: 2023-03-15 16:29:59,169 Epoch: 1 2023-03-15 16:30:01,907 train: [ 1/30] Step 000/2638 SalLoss 0.303 2023-03-15 16:30:20,603 train: [ 1/30] Step 050/2638 SalLoss 0.439 2023-03-15 16:30:39,186 train: [ 1/30] Step 100/2638 SalLoss 0.040 2023-03-15 16:30:57,614 train: [ 1/30] Step 150/2638 SalLoss 0.058 2023-03-15 16:31:16,081 train: [ 1/30] Step 200/2638 SalLoss 0.656 2023-03-15 16:31:35,047 train: [ 1/30] Step 250/2638 SalLoss 0.021 2023-03-15 16:31:53,627 train: [ 1/30] Step 300/2638 SalLoss 0.006 2023-03-15 16:32:13,064 train: [ 1/30] Step 350/2638 SalLoss 0.434 2023-03-15 16:32:33,673 train: [ 1/30] Step 400/2638 SalLoss 0.007 2023-03-15 16:32:52,878 train: [ 1/30] Step 450/2638 SalLoss 0.088 2023-03-15 16:33:12,286 train: [ 1/30] Step 500/2638 SalLoss 0.551 I change the lr to small but the value is still werid: 2023-03-16 17:13:09,250 train: [ 1/30] Step 650/2638 SalLoss 0.014 2023-03-16 17:13:29,606 train: [ 1/30] Step 700/2638 SalLoss 0.013 2023-03-16 17:13:49,984 train: [ 1/30] Step 750/2638 SalLoss 0.015 2023-03-16 17:14:11,109 train: [ 1/30] Step 800/2638 SalLoss 0.096 2023-03-16 17:14:31,496 train: [ 1/30] Step 850/2638 SalLoss 0.016 2023-03-16 17:14:52,608 train: [ 1/30] Step 900/2638 SalLoss 0.014 2023-03-16 17:15:13,063 train: [ 1/30] Step 950/2638 SalLoss 0.236 2023-03-16 17:15:34,633 train: [ 1/30] Step 1000/2638 SalLoss 0.021 2023-03-16 17:15:56,999 train: [ 1/30] Step 1050/2638 SalLoss 1.709 2023-03-16 17:16:18,445 train: [ 1/30] Step 1100/2638 SalLoss 0.008 2023-03-16 17:16:40,464 train: [ 1/30] Step 1150/2638 SalLoss 2.499 So I am wondering, could you offer the running log so I can check the loss value. or maybe you can get me some tips to handle my question.

gmongaras commented 1 year ago

I added a training log here: https://github.com/gmongaras/Diffusion_models_from_scratch/blob/main/results/res_res_partial_log.out

What do you mean when you say you moved this project to a segmentation task? Are you using a pre-trained model and finetuning it in another repo to perform image segmentation?

If you are trying to train a model from scratch in this repo, I think the issue may fall down three factors:

I trained my model for 600,000 steps. At 1150 steps, the model will produce all noise.
I trained with an image size of 64x64 which had issues fitting on an A100 with a batch size of 128. If you make the batch size too small, you may run into instability during training. However, the gradAccSteps can help you fix that issue at the cost of time.
A small value of t/T will result in terrible image quality. T values should probably not be any lower than 1000.

LiuTingWed commented 1 year ago

Good suggestion, I have already expanded T to 1000, and luckily I got 8 V100 cards (excited laughter). I'm training now. But I'm still learning your code. One thing I don't quite understand is if is_main_process(): print(f"Loss at epoch #{epoch}, step #{num_steps}, update #{num_steps/self.numSteps}\n"+\ f"Combined: {round(self.losses_comb[-10:].mean(), 4)} "\ f"Mean: {round(self.losses_mean[-10:].mean(), 4)} "\ f"Variance: {round(self.losses_var[-10:].mean(), 6)}\n\n")

why does it start iterating from the last 10th element? It seems a bit odd if it's output the value of the last tenth step in the record.

gmongaras commented 1 year ago

That's just for logging. Instead of outputting the latest loss value (-1), I output the mean of the latest 10 losses (-10:) to reduce noise in the output loss value.

LiuTingWed commented 1 year ago

Hello, I think you understand Diffusion better than me, so I want to discuss with you and see if you can solve my problem. Here is the situation: when I use diffusion for segmentation tasks, I find that: the local host (22080ti pytorch=1.8) loading the checkpoint trained on the server side (24090 pytorch=1.9) cannot achieve the same performance as the server (dice 84 vs 81), this problem puzzles me a lot. In order to find the best checkpoint, I follow the approach of using DDIM accelerated sampling to make inferences after training 2 epochs, and then test to get results. The problem is: during training on the server side, the test metrics can reach 84, but when loading the checkpoint separately on the server side for testing, it is 82. More strangely, when testing on the local host, it is 81. The difference in these metrics makes it hard for me to understand. I suspect it may be because Diffusion needs to initialize random noise? However, I still encounter this problem even when setting the same random seed. In fact, regarding DDIM, choosing a different iteration batch size each time also leads to slightly different performance, which puzzles me as well. I look forward to your reply.

gmongaras / Diffusion_models_from_scratch

RuntimeError: einsum() operand subscript must be in range [a, z] but found C for operand 1 #6