Hello, first of all thank you for your extensive code base!

I implemented the InterFuser and loaded the dataset from LMDrive. I modified the data loading process for the RGB images and the folder structure. When I start the training, I get the following error:

FutureWarning, WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Added key: store_based_barrier_key:1 to store for rank: 1 Added key: store_based_barrier_key:1 to store for rank: 0 Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. Training in distributed mode with multiple processes, 1 GPU per process. Process 0, total 2. Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes. Training in distributed mode with multiple processes, 1 GPU per process. Process 1, total 2. Model resnet101 created, param count:44549160 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.485, 0.456, 0.406) std: (0.229, 0.224, 0.225) crop_pct: 0.875 AMP not enabled. Training in float32. Using native Torch DistributedDataParallel. Scheduled epochs: 210 Sub route dir nums: 3011006 Sub route dir nums: 3011006 Sub route dir nums: 43120 Sub route dir nums: 43120 Traceback (most recent call last): File "train.py", line 1844, in main() File "train.py", line 1241, in main mixup_fn=mixup_fn, File "train.py", line 1365, in train_one_epoch output = model(input) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 980, in forward x = self.forward_features(x) File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 968, in forward_features x = self.conv1(x) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward self.padding, self.dilation, self.groups) TypeError: conv2d() received an invalid combination of arguments - got (dict, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:

(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (dict, Parameter, NoneType, tuple, tuple, tuple, int)
(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (dict, Parameter, NoneType, tuple, tuple, tuple, int)

Traceback (most recent call last): File "train.py", line 1844, in main() File "train.py", line 1241, in main mixup_fn=mixup_fn, File "train.py", line 1365, in train_one_epoch output = model(input) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, kwargs) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, *kwargs) File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 980, in forward x = self.forward_features(x) File "/beegfs/work_fast/stelzer/InterFuser/interfuser/timm/models/resnet.py", line 968, in forward_features x = self.conv1(x) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(input, **kwargs) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward self.padding, self.dilation, self.groups) TypeError: conv2d() received an invalid combination of arguments - got (dict, Parameter, NoneType, tuple, tuple, tuple, int), but expected one of:

(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (dict, Parameter, NoneType, tuple, tuple, tuple, int)
(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (dict, Parameter, NoneType, tuple, tuple, tuple, int)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2777253) of binary: /home/stelzer/work/anaconda3/envs/interfuser/bin/python3 Traceback (most recent call last): File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run )(*cmd_args) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/stelzer/work/anaconda3/envs/interfuser/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2024-06-17_16:12:26 host : gpu07 rank : 1 (local_rank: 1) exitcode : 1 (pid: 2777254) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-06-17_16:12:26 host : gpu07 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2777253) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

scripts/train.sh: line 6: --model: command not found

I get the same error message when I simply do output=model(3) for example. Do you have any idea where this error comes from?

Thank you!

opendilab / InterFuser

Training Error #94

train.py FAILED

Failures: [1]: time : 2024-06-17_16:12:26 host : gpu07 rank : 1 (local_rank: 1) exitcode : 1 (pid: 2777254) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-06-17_16:12:26 host : gpu07 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2777253) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html