Hello,
first of all thank you for your extensive code base!
I implemented the InterFuser and loaded the dataset from LMDrive.
I modified the data loading process for the RGB images and the folder structure. When I start the training, I get the following error:
FutureWarning,
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Added key: store_based_barrier_key:1 to store for rank: 1
Added key: store_based_barrier_key:1 to store for rank: 0
Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2.
Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 2.
Model resnet101 created, param count:44549160
Data processing configuration for current model + dataset:
input_size: (3, 224, 224)
interpolation: bicubic
mean: (0.485, 0.456, 0.406)
std: (0.229, 0.224, 0.225)
crop_pct: 0.875
AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Scheduled epochs: 210
Sub route dir nums: 3011006
Sub route dir nums: 3011006
Sub route dir nums: 43120
Sub route dir nums: 43120
Traceback (most recent call last):
File "train.py", line 1844, in
main()
File "train.py", line 1241, in main
mixup_fn=mixup_fn,
File "train.py", line 1365, in train_one_epoch
output = model(input)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, *kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(inputs[0], kwargs[0])
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 980, in forward
x = self.forward_features(x)
File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 968, in forward_features
x = self.conv1(x)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, **kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward
self.padding, self.dilation, self.groups)
TypeError: conv2d() received an invalid combination of arguments - got (dict, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:
(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (dict, Parameter, NoneType, tuple, tuple, tuple, int)
(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (dict, Parameter, NoneType, tuple, tuple, tuple, int)
Traceback (most recent call last):
File "train.py", line 1844, in
main()
File "train.py", line 1241, in main
mixup_fn=mixup_fn,
File "train.py", line 1365, in train_one_epoch
output = model(input)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, *kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(inputs[0], kwargs[0])
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 980, in forward
x = self.forward_features(x)
File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 968, in forward_features
x = self.conv1(x)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, **kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward
self.padding, self.dilation, self.groups)
TypeError: conv2d() received an invalid combination of arguments - got (dict, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:
(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (dict, Parameter, NoneType, tuple, tuple, tuple, int)
(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (dict, Parameter, NoneType, tuple, tuple, tuple, int)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2777253) of binary: /home/stelzer/work/anaconda3/envs/interfuser/bin/python3
Traceback (most recent call last):
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Hello, first of all thank you for your extensive code base!
I implemented the InterFuser and loaded the dataset from LMDrive. I modified the data loading process for the RGB images and the folder structure. When I start the training, I get the following error:
FutureWarning, WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Added key: store_based_barrier_key:1 to store for rank: 1 Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2. Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 2. Model resnet101 created, param count:44549160 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.485, 0.456, 0.406) std: (0.229, 0.224, 0.225) crop_pct: 0.875 AMP not enabled. Training in float32. Using native Torch DistributedDataParallel. Scheduled epochs: 210 Sub route dir nums: 3011006 Sub route dir nums: 3011006 Sub route dir nums: 43120 Sub route dir nums: 43120 Traceback (most recent call last): File "train.py", line 1844, in
main()
File "train.py", line 1241, in main
mixup_fn=mixup_fn,
File "train.py", line 1365, in train_one_epoch
output = model(input)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, *kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(inputs[0], kwargs[0])
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 980, in forward
x = self.forward_features(x)
File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 968, in forward_features
x = self.conv1(x)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, **kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward
self.padding, self.dilation, self.groups)
TypeError: conv2d() received an invalid combination of arguments - got (dict, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:
Traceback (most recent call last): File "train.py", line 1844, in
main()
File "train.py", line 1241, in main
mixup_fn=mixup_fn,
File "train.py", line 1365, in train_one_epoch
output = model(input)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, *kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(inputs[0], kwargs[0])
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, *kwargs)
File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 980, in forward
x = self.forward_features(x)
File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 968, in forward_features
x = self.conv1(x)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(input, **kwargs)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward
self.padding, self.dilation, self.groups)
TypeError: conv2d() received an invalid combination of arguments - got (dict, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2777253) of binary: /home/stelzer/work/anaconda3/envs/interfuser/bin/python3 Traceback (most recent call last): File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: [1]: time : 2024-06-17_16:12:26 host : gpu07 rank : 1 (local_rank: 1) exitcode : 1 (pid: 2777254) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-06-17_16:12:26 host : gpu07 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2777253) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
scripts/train.sh: line 6: --model: command not found
I get the same error message when I simply do output=model(3) for example. Do you have any idea where this error comes from?
Thank you!