Closed thyywr759 closed 1 year ago
cc @pacman100
Hello, can you share which version of Hugging Face Accelerate library are you using? Alternatively, please download the latest version of Accelerate via pip install accelerate
and check if the issue still remains.
Also, please share a minimal example that we can run to debug if the issue persists.
when I remove the environment variable #export WANDB_LOG_MODEL=true , the problem is solved
This suggests that it has nothing to do with a specific project, but rather with wandb
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I have same issue Attach dependency, nvidiasmi and accelerator config. Using AWS g5x48 instance with AMI Deep Learning Base GPU AMI (Ubuntu 20.04) 20230926
```bash
(ft-llm) ubuntu@ip-172-31-89-151:~/llm-ft/falcon$ accelerate launch main2.py
/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:641: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
[2023-10-18 12:59:10,836] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-18 12:59:12,294] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-18 12:59:12,294] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
fsdp_plugin
FullyShardedDataParallelPlugin(sharding_strategy=<ShardingStrategy.FULL_SHARD: 1>, backward_prefetch=None, mixed_precision_policy=None, auto_wrap_policy=None, cpu_offload=CPUOffload(offload_params=False), ignored_modules=None, state_dict_type=<StateDictType.FULL_STATE_DICT: 1>, state_dict_config=FullStateDictConfig(offload_to_cpu=False, rank0_only=False), optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=False, rank0_only=False), limit_all_gathers=False, use_orig_params=False, param_init_fn=<function FullyShardedDataParallelPlugin.__post_init__.<locals>.<lambda> at 0x7f184df4a0d0>, sync_module_states=True, forward_prefetch=False, activation_checkpointing=False)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00, 2.94s/it]
Traceback (most recent call last):
File "/home/ubuntu/llm-ft/falcon/main2.py", line 41, in <module>
state_dict=accelerator.get_state_dict(model)
File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/accelerator.py", line 3060, in get_state_dict
if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 36540) of binary: /home/ubuntu/anaconda3/envs/ft-llm/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/ft-llm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 971, in launch_command
deepspeed_launcher(args)
File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
distrib_run.run(args)
File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main2.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-10-18_13:00:03
host : ip-172-31-89-151.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 36540)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Simple code to reproduce
```python
from accelerate import Accelerator
from transformers import AutoModelForCausalLM
from accelerate import FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
fsdp_plugin = FullyShardedDataParallelPlugin(
state_dict_config=FullStateDictConfig(offload_to_cpu=False, rank0_only=False),
optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=False, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b-instruct"
)
model.gradient_checkpointing_enable()
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
"./lora_test2",
is_main_process=accelerator.is_main_process,
save_function=accelerator.save,
state_dict=accelerator.get_state_dict(model)
)
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
above
Expected behavior
above