suprosanna / relationformer

Apache License 2.0
105 stars 15 forks source link

Input type and weight type error in scene graph code #16

Open Yassin-fan opened 1 year ago

Yassin-fan commented 1 year ago

Hi, I have installed the code in python3.8, pytorch 1.8.0 and cuda11. And the debug_relationformer.ipynb runs well about Debug Dataloader and Debug Model part. However, when I run the train.py using "nohup python3 train.py --config configs/scene_2d.yaml --cuda_visible_device 0 1 2 --exp_name VGtest1 --nproc_per_node 3 --b 16 &> log/Muti.out& ", there is an error:

* Config file configs/scene_2d.yaml Experiment Name : VGtest1 Batch size : 16 Running Distributed: True ; GPU: 0 ; RANK: 0 Number of parameters : 92944451 ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same Traceback (most recent call last): File "train.py", line 292, in parallel.run(main, args) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/launcher.py", line 275, in run idist.spawn(self.backend, func, args=args, kwargs_dict=kwargs, self._spawn_params) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/utils.py", line 323, in spawn comp_model_cls.spawn( File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/comp_models/native.py", line 304, in spawn start_processes( File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/comp_models/native.py", line 272, in _dist_worker_task_fn fn(local_rank, args, kw_dict) File "/home/ymf/dockerFile/relationformer/train.py", line 282, in main trainer.run() File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/monai/engines/trainer.py", line 56, in run super().run() File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/monai/engines/workflow.py", line 250, in run super().run(data=self.data_loader, max_epochs=self.state.max_epochs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run return self._internal_run() File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run self._handle_exception(e) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception raise e File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 745, in _internal_run time_taken = self._run_once_on_dataset() File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset self._handle_exception(e) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception raise e File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset self.state.output = self._process_function(self, self.state.batch) File "/home/ymf/dockerFile/relationformer/trainer.py", line 40, in _iteration h, out = self.network(images) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 711, in forward output = self.module(inputs, kwargs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/ymf/dockerFile/relationformer/models/relationformer_2D.py", line 108, in forward features, pos = self.backbone(samples) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/ymf/dockerFile/relationformer/models/deformable_detr_backbone.py", line 117, in forward xs = self0 File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/ymf/dockerFile/relationformer/models/deformable_detr_backbone.py", line 84, in forward xs = self.body(tensor_list.tensors) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torchvision/models/_utils.py", line 63, in forward x = module(x) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward return self._conv_forward(input, self.weight, self.bias) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Do you have any clue about this error and how to fix it? Thanks!

Bartopt commented 1 year ago

Hi, I have installed the code in python3.8, pytorch 1.8.0 and cuda11. And the debug_relationformer.ipynb runs well about Debug Dataloader and Debug Model part. However, when I run the train.py using "nohup python3 train.py --config configs/scene_2d.yaml --cuda_visible_device 0 1 2 --exp_name VGtest1 --nproc_per_node 3 --b 16 &> log/Muti.out& ", there is an error:

* Config file configs/scene_2d.yaml Experiment Name : VGtest1 Batch size : 16 Running Distributed: True ; GPU: 0 ; RANK: 0 Number of parameters : 92944451 ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Current run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same ERROR:ignite.engine.engine.RelationformerTrainer:Engine run is terminating due to exception: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same Traceback (most recent call last): File "train.py", line 292, in parallel.run(main, args) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/launcher.py", line 275, in run idist.spawn(self.backend, func, args=args, kwargs_dict=kwargs, self._spawn_params) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/utils.py", line 323, in spawn comp_model_cls.spawn( File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/comp_models/native.py", line 304, in spawn start_processes( File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/distributed/comp_models/native.py", line 272, in _dist_worker_task_fn fn(local_rank, args, kw_dict) File "/home/ymf/dockerFile/relationformer/train.py", line 282, in main trainer.run() File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/monai/engines/trainer.py", line 56, in run super().run() File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/monai/engines/workflow.py", line 250, in run super().run(data=self.data_loader, max_epochs=self.state.max_epochs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 702, in run return self._internal_run() File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 775, in _internal_run self._handle_exception(e) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception raise e File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 745, in _internal_run time_taken = self._run_once_on_dataset() File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 850, in _run_once_on_dataset self._handle_exception(e) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 469, in _handle_exception raise e File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/ignite/engine/engine.py", line 833, in _run_once_on_dataset self.state.output = self._process_function(self, self.state.batch) File "/home/ymf/dockerFile/relationformer/trainer.py", line 40, in _iteration h, out = self.network(images) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 711, in forward output = self.module(inputs, kwargs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/home/ymf/dockerFile/relationformer/models/relationformer_2D.py", line 108, in forward features, pos = self.backbone(samples) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/home/ymf/dockerFile/relationformer/models/deformable_detr_backbone.py", line 117, in forward xs = self0 File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/home/ymf/dockerFile/relationformer/models/deformable_detr_backbone.py", line 84, in forward xs = self.body(tensor_list.tensors) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torchvision/models/_utils.py", line 63, in forward x = module(x) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward return self._conv_forward(input, self.weight, self.bias) File "/data/anaconda3/envs/ymf_rel38/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward return F.conv2d(input, weight, bias, self.stride, RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Do you have any clue about this error and how to fix it? Thanks!

I have the same problem. How do you fix it?

MeklitMa commented 11 months ago

@Bartopt @Yassin-fan Hi did any one fix this problem? please if you did, how did you fix it thank you.

tyxtyxtyxtyx commented 10 months ago

@Bartopt @Yassin-fan @MeklitMa Hi, guys, please tell me if you fix it, thanks very much!

incredibledays commented 6 months ago

@Bartopt @tyxtyxtyxtyx @Yassin-fan @MeklitMa @suprosanna Hi, can anyone give me the source code? Thank you very much? My email is zaoyifan@buaa.edu.cn

JoseLuisNeves commented 4 months ago

In trainer.py, I just sent the self.network to cuda device before line 41 (h, out = self.network(images)), from which the error was coming @Bartopt @tyxtyxtyxtyx @Yassin-fan @incredibledays @MeklitMa.