durenajafamjad commented 4 days ago

Distributed environment: FSDP Backend: nccl Num processes: 4 Process index: 3 Local process index: 3 Device: cuda:0

Mixed precision type: bf16

Distributed environment: FSDP Backend: nccl Num processes: 4 Process index: 0 Local process index: 0 Device: cuda:0

Mixed precision type: bf16

Distributed environment: FSDP Backend: nccl Num processes: 4 Process index: 1 Local process index: 1 Device: cuda:0

Mixed precision type: bf16

Distributed environment: FSDP Backend: nccl Num processes: 4 Process index: 2 Local process index: 2 Device: cuda:0

Mixed precision type: bf16

Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 6.57it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 6.34it/s] Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 6.27it/s] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:00<00:00, 7.75it/s] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:00<00:00, 7.68it/s] Loading checkpoint shards: 67%|██████▋ | 2/3 [00:00<00:00, 7.59it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.34it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.01it/s]

Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.34it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 7.97it/s]

Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 8.27it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 7.90it/s]

Loading checkpoint shards: 33%|███▎ | 1/3 [00:01<00:02, 1.09s/it]Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens. rank1: Traceback (most recent call last): rank1: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 574, in

rank1: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 415, in main rank1: clip_model = CLIPModel.from_pretrained(args.rag, torch_dtype=dtype) rank1: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained rank1: ) = cls._load_pretrained_model( rank1: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4725, in _load_pretrained_model rank1: model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype) rank1: TypeError: empty() missing 1 required positional arguments: "size" rank2: Traceback (most recent call last): rank2: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 574, in

rank2: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 415, in main rank2: clip_model = CLIPModel.from_pretrained(args.rag, torch_dtype=dtype) rank2: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained rank2: ) = cls._load_pretrained_model( rank2: File "/home/scur2844/.conda/envs/agentsmith/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4725, in _load_pretrained_model rank2: model_to_load, key, "cpu", torch.empty(*param.size(), dtype=dtype) rank2: TypeError: empty() missing 1 required positional arguments: "size" rank3: Traceback (most recent call last): rank3: File "/gpfs/home3/scur2844/Agent-Smith/attack/optimize.py", line 574, in

api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

attack/optimize.py FAILED

Failures: [1]: time : 2024-11-20_23:01:42 host : gcn15.local.snellius.surf.nl rank : 2 (local_rank: 2) exitcode : 1 (pid: 979972) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-11-20_23:01:42 host : gcn15.local.snellius.surf.nl rank : 3 (local_rank: 3) exitcode : 1 (pid: 979973) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-11-20_23:01:42 host : gcn15.local.snellius.surf.nl rank : 1 (local_rank: 1) exitcode : 1 (pid: 979971) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Hi, thank you for writing this super interesting paper. As I am trying to reproduce the results, I keep encountering this error, particularly when --num_processes=4. When --num_processes=1, it works but I run out of memory quite early on in training. Please let me know if there is anything I can do to fix it. Thank you for your time and cooperation!

guxm2021 commented 9 hours ago

Sorry for my late response. Could you please check your environment, especially for versions of PyTorch, transformers, tokenizers, accelerate? It seems that this error is due to the compatibility of pre-trained CLIP loading and FSDP in accelerate. As this repo has been a while since its initial release, we need to take some time to check the compatibility.

durenajafamjad commented 3 hours ago

Thank you for your reply. I was able to fix it by adjusting the versions.

sail-sg / Agent-Smith

TypeError: empty() missing 1 required positional arguments: "size" #5

attack/optimize.py FAILED

Root Cause (first observed failure): [0]: time : 2024-11-20_23:01:42 host : gcn15.local.snellius.surf.nl rank : 1 (local_rank: 1) exitcode : 1 (pid: 979971) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html