Shape cannot match the size during training

cqbu commented 9 months ago

During the training, in the part of backbone, I got this error:

File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l) RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

this happened in the part of SpatialImageLanguageAttention, I found num_heads is 1, so this is not a MultiheadAttention right? but I don't know whether the shape or the size is wrong, so what is the expected shape or size?

and the full error message is below: Traceback (most recent call last): File "train_net_lmpm.py", line 318, in launch( File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch mp.start_processes( File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker main_func(args) File "/root/MeViS/train_net_lmpm.py", line 312, in main return trainer.train() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train super().train(self.start_iter, self.max_iter) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 494, in run_step loss_dict = self.model(data) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], *kwargs[0]) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/MeViS/lmpm/lmpm_model.py", line 281, in forward return self.train_model(batched_inputs) File "/root/MeViS/lmpm/lmpm_model.py", line 312, in train_model features = self.backbone(images.tensor, lang_feat_sentence, lang_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 785, in forward y = super().forward(x, l, l_mask) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 470, in forward x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww, l, l_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 590, in forward x_residual = self.fusion(x, l, l_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 627, in forward lang = self.image_lang_att(x, l, l_mask) # (B, HW, dim) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l) RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

cqbu commented 9 months ago

btw, I found when we call build_batch_data_loader, the parameter ‘prefetch_factor’ is not given, but in detectron2, the default value of prefetch_factor is None, which leads to error in DataLoader of torch when running assert prefetch_factor > 0， because prefetch_factor here is None but 0 is int.

heshuting555 commented 9 months ago

You can try to use multiple gpus to run! And the error will go away!

cilinyan commented 8 months ago

You can try to use multiple gpus to run! And the error will go away!

One simple approach is to ensure that only one video is trained on each GPU.

If you want to train multiple videos on GPU, you may need to make modifications in several parts of the code, such asthis.

wwyy1234 commented 2 months ago

During the training, in the part of backbone, I got this error:

File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l) RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

this happened in the part of SpatialImageLanguageAttention, I found num_heads is 1, so this is not a MultiheadAttention right? but I don't know whether the shape or the size is wrong, so what is the expected shape or size?

and the full error message is below: Traceback (most recent call last): File "train_net_lmpm.py", line 318, in launch( File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch mp.start_processes( File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker main_func(args) File "/root/MeViS/train_net_lmpm.py", line 312, in main return trainer.train() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train super().train(self.start_iter, self.max_iter) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 494, in run_step loss_dict = self.model(data) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], *kwargs[0]) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/MeViS/lmpm/lmpm_model.py", line 281, in forward return self.train_model(batched_inputs) File "/root/MeViS/lmpm/lmpm_model.py", line 312, in train_model features = self.backbone(images.tensor, lang_feat_sentence, lang_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 785, in forward y = super().forward(x, l, l_mask) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 470, in forward x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww, l, l_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 590, in forward x_residual = self.fusion(x, l, l_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(_input, kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 627, in forward lang = self.image_lang_att(x, l, l_mask) # (B, H_W, dim) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l) RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

Excuse me, have you solved this problem? I encountered the same issue. I'm using two GPUs. Could you please let me know how you resolved it?

henghuiding / MeViS

Shape cannot match the size during training #3