Open durenajafamjad opened 4 days ago
Sorry for my late response. Could you please check your environment, especially for versions of PyTorch, transformers, tokenizers, accelerate? It seems that this error is due to the compatibility of pre-trained CLIP loading and FSDP in accelerate. As this repo has been a while since its initial release, we need to take some time to check the compatibility.
Thank you for your reply. I was able to fix it by adjusting the versions.
Distributed environment: FSDP Backend: nccl Num processes: 4 Process index: 3 Local process index: 3 Device: cuda:0
Mixed precision type: bf16
Distributed environment: FSDP Backend: nccl Num processes: 4 Process index: 0 Local process index: 0 Device: cuda:0
Mixed precision type: bf16
Distributed environment: FSDP Backend: nccl Num processes: 4 Process index: 1 Local process index: 1 Device: cuda:0
Mixed precision type: bf16
Distributed environment: FSDP Backend: nccl Num processes: 4 Process index: 2 Local process index: 2 Device: cuda:0
Mixed precision type: bf16
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 6.57it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 6.34it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 6.27it/s] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:00<00:00, 7.75it/s] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:00<00:00, 7.68it/s] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:00<00:00, 7.59it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.34it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.01it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.34it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 7.97it/s]
Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.27it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 7.90it/s]
Loading checkpoint shards: 33%|███▎ | 1/3 [00:01<00:02, 1.09s/it]Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. rank1: Traceback (most recent call last): rank1: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 574, in
rank1: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 415, in main rank1: clip_model = CLIPModel.from_pretrained(args.rag, torch_dtype=dtype) rank1: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained rank1: ) = cls._load_pretrained_model( rank1: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4725, in _load_pretrained_model rank1: model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype) rank1: TypeError: empty() missing 1 required positional arguments: "size" rank2: Traceback (most recent call last): rank2: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 574, in
rank2: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 415, in main rank2: clip_model = CLIPModel.from_pretrained(args.rag, torch_dtype=dtype) rank2: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained rank2: ) = cls._load_pretrained_model( rank2: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4725, in _load_pretrained_model rank2: model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype) rank2: TypeError: empty() missing 1 required positional arguments: "size" rank3: Traceback (most recent call last): rank3: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 574, in
rank3: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 415, in main rank3: clip_model = CLIPModel.from_pretrained(args.rag, torch_dtype=dtype) rank3: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained rank3: ) = cls._load_pretrained_model( rank3: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4725, in _load_pretrained_model rank3: model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype) rank3: TypeError: empty() missing 1 required positional arguments: "size" rank1:[W1120 23:01:41.506008594 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) rank2:[W1120 23:01:41.570999044 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) rank3:[W1120 23:01:42.640619630 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) W1120 23:01:42.439000 979934 /gpfs/home3/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 979970 closing signal SIGTERM E1120 23:01:42.754000 979934 /gpfs/home3/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 979971) of binary: /home/scur2844/.conda/envs/agentsmith/bin/python Traceback (most recent call last): File "/home/scur2844/.conda/envs/agentsmith/bin/accelerate", line 8, in
sys.exit(main())
File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1155, in launch_command
multi_gpu_launcher(args)
File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
attack/optimize.py FAILED
Failures: [1]: time : 2024-11-20_23:01:42 host : gcn15.local.snellius.surf.nl rank : 2 (local_rank: 2) exitcode : 1 (pid: 979972) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-11-20_23:01:42 host : gcn15.local.snellius.surf.nl rank : 3 (local_rank: 3) exitcode : 1 (pid: 979973) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2024-11-20_23:01:42 host : gcn15.local.snellius.surf.nl rank : 1 (local_rank: 1) exitcode : 1 (pid: 979971) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Hi, thank you for writing this super interesting paper. As I am trying to reproduce the results, I keep encountering this error, particularly when --num_processes=4. When --num_processes=1, it works but I run out of memory quite early on in training. Please let me know if there is anything I can do to fix it. Thank you for your time and cooperation!