train not succeed - Githubissues

skycat88 commented 1 year ago

size mismatch for image_encoder.blocks.23.mlp.lin1.weight: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([5120, 1280]). size mismatch for image_encoder.blocks.23.mlp.lin1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([5120]). size mismatch for image_encoder.blocks.23.mlp.lin2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([1280, 5120]). size mismatch for image_encoder.blocks.23.mlp.lin2.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([1280]). size mismatch for image_encoder.neck.0.weight: copying a param with shape torch.Size([256, 1024, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 1280, 1, 1]). ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3207998) of binary: /home/syy/anaconda3/envs/SAM_Adapter/bin/python Traceback (most recent call last): File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/syy/anaconda3/envs/SAM_Adapter/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================ train.py FAILED

Failures: [1]: time : 2023-04-24_19:02:47 host : vip rank : 1 (local_rank: 1) exitcode : 1 (pid: 3208003) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-04-24_19:02:47 host : vip rank : 2 (local_rank: 2) exitcode : 1 (pid: 3208005) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-04-24_19:02:47 host : vip rank : 3 (local_rank: 3) exitcode : 1 (pid: 3208011) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-04-24_19:02:47 host : vip rank : 0 (local_rank: 0) exitcode : 1 (pid: 3207998) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

(SAM_Adapter) syy@vip:~/code/data_auto/SAM-Adapter-PyTorch$ python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 train.py --config configs/demo.yaml

1、环境版本按照要求配置 readme 中的 loadddptrain.py 没有，使用的是train, 2、下载的数据是cmos,, 请问数据处理有其他要求吗训练实验用的数据，只有下面的伪装物检测数据，制作的1500 CAMO-COCO-V.1.0-CVIU2019\Camouflage\Images GT

feijifei commented 1 year ago

I have the same issue.

laiyingxin2 commented 1 year ago

Same question

Chukuanren commented 1 year ago

I have the same issue.

解决了吗

huizhang0110 commented 1 year ago

same question!

85zhanghao commented 1 year ago

same question！

Darren759 commented 1 year ago

you can try: CUDA_VISIBLE_DEVICES=8,9,10,11 python -m torch.distributed.launch train.py

yPanStupidog commented 1 year ago

@skycat88 You should change the config .yaml file in ./configs/, specifically the pretrained weight should be h rather than l.

Chenjuanwen commented 1 year ago

I think the train command should be: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nnodes 1 --nproc_per_node 4 train.py --config [CONFIG_PATH]

lzn12345 commented 1 year ago

how do you solve it? Is there four gpus?so if i only have one ,how should I change the command?thank you all

1289600760 commented 1 year ago

same question！

tianrun-chen commented 1 year ago

Please check whether you have change the config .yaml file in ./configs/ with the right SAM checkpoints file.

lydmom commented 1 year ago

solve same exception by install package strictly as the requirements and modify the config

theneao commented 1 year ago

@skycat88您应该更改 ./configs/ 中的配置 .yaml 文件，特别是预训练权重应该是 h 而不是 l。

是这样的，我改了一下就可以运行了

almighty79251 commented 11 months ago

I have the same issue.

解决了吗

同问解决了吗

guokeqianhg commented 4 months ago

how do you solve it? Is there four gpus?so if i only have one ,how should I change the command?thank you all

May I ask if you have resolved the problem

1289600760 commented 4 months ago

我有同样的问题。

解决了吗

同问解决了吗

maybe? i can use ddp now,i i resolved this problem by reseted my ubuntu ,but i have not to do this program

tianrun-chen / SAM-Adapter-PyTorch

train not succeed #15

============================================================ train.py FAILED

Root Cause (first observed failure): [0]: time : 2023-04-24_19:02:47 host : vip rank : 0 (local_rank: 0) exitcode : 1 (pid: 3207998) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html