huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.91k stars 964 forks source link

AttributeError: 'Accelerator' object has no attribute 'deepspeed_config' #1845

Closed thyywr759 closed 1 year ago

thyywr759 commented 1 year ago

System Info

Describe the bug
on_train_end, raise AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'

To Reproduce
None

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
[2023-08-14 18:02:42,266] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Screenshots
Traceback (most recent call last):
File "main.py", line 430, in 
main()
File "/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "main.py", line 374, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/transformers/trainer.py", line 1971, in _inner_training_loop
self.control = self.callback_handler.on_train_end(args, self.state, self.control)
File "/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/transformers/trainer_callback.py", line 356, in on_train_end
return self.call_event("on_train_end", args, state, control)
File "/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/transformers/trainer_callback.py", line 397, in call_event
result = getattr(callback, event)(
File "/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/transformers/integrations.py", line 770, in on_train_end
fake_trainer.save_model(temp_dir)
File "/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/transformers/trainer.py", line 2758, in save_model
state_dict = self.accelerator.get_state_dict(self.deepspeed)
File "/home/maojianguo/anaconda3/envs/mjg_torch2.0.1/lib/python3.8/site-packages/accelerate/accelerator.py", line 2829, in get_state_dict
if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'

System info (please complete the following information):

OS: Ubuntu 18.04
GPU count and types : one machine with x8 A800s
Python version: 3.8
transformers: 4.31.0
deepspeed: 0.10.0
accelerator: 2023.7.18.dev1
Launcher context
{
"train_micro_batch_size_per_gpu": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients" : true,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
}
},
"gradient_accumulation_steps": "auto",
"steps_per_print": "auto",
"bf16": {
"enabled": "auto"
}
}

Information

Tasks

Reproduction

above

Expected behavior

above

sgugger commented 1 year ago

cc @pacman100

pacman100 commented 1 year ago

Hello, can you share which version of Hugging Face Accelerate library are you using? Alternatively, please download the latest version of Accelerate via pip install accelerate and check if the issue still remains.

pacman100 commented 1 year ago

Also, please share a minimal example that we can run to debug if the issue persists.

thyywr759 commented 1 year ago

when I remove the environment variable #export WANDB_LOG_MODEL=true , the problem is solved

This suggests that it has nothing to do with a specific project, but rather with wandb

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ronyadgar commented 1 year ago

I have same issue Attach dependency, nvidiasmi and accelerator config. Using AWS g5x48 instance with AMI Deep Learning Base GPU AMI (Ubuntu 20.04) 20230926

```bash

(ft-llm) ubuntu@ip-172-31-89-151:~/llm-ft/falcon$ accelerate launch main2.py
/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/utils/dataclasses.py:641: UserWarning: DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.
  warnings.warn("DeepSpeed Zero3 Init flag is only applicable for ZeRO Stage 3. Setting it to False.")
[2023-10-18 12:59:10,836] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-10-18 12:59:12,294] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-10-18 12:59:12,294] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
fsdp_plugin
FullyShardedDataParallelPlugin(sharding_strategy=<ShardingStrategy.FULL_SHARD: 1>, backward_prefetch=None, mixed_precision_policy=None, auto_wrap_policy=None, cpu_offload=CPUOffload(offload_params=False), ignored_modules=None, state_dict_type=<StateDictType.FULL_STATE_DICT: 1>, state_dict_config=FullStateDictConfig(offload_to_cpu=False, rank0_only=False), optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=False, rank0_only=False), limit_all_gathers=False, use_orig_params=False, param_init_fn=<function FullyShardedDataParallelPlugin.__post_init__.<locals>.<lambda> at 0x7f184df4a0d0>, sync_module_states=True, forward_prefetch=False, activation_checkpointing=False)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:05<00:00,  2.94s/it]
Traceback (most recent call last):
  File "/home/ubuntu/llm-ft/falcon/main2.py", line 41, in <module>
    state_dict=accelerator.get_state_dict(model)
  File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/accelerator.py", line 3060, in get_state_dict
    if self.deepspeed_config["zero_optimization"]["stage"] == 3:
AttributeError: 'Accelerator' object has no attribute 'deepspeed_config'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 36540) of binary: /home/ubuntu/anaconda3/envs/ft-llm/bin/python
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/ft-llm/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 971, in launch_command
    deepspeed_launcher(args)
  File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
    distrib_run.run(args)
  File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/anaconda3/envs/ft-llm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main2.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-18_13:00:03
  host      : ip-172-31-89-151.ec2.internal
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 36540)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Simple code to reproduce

```python

from accelerate import Accelerator
from transformers import AutoModelForCausalLM
from accelerate import FullyShardedDataParallelPlugin
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=False, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=False, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/falcon-7b-instruct"
)
model.gradient_checkpointing_enable()
accelerator.wait_for_everyone()

unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
    "./lora_test2",
    is_main_process=accelerator.is_main_process,
    save_function=accelerator.save,
    state_dict=accelerator.get_state_dict(model)
)

default_config.yaml.txt nvidia_smi.txt conda_list.txt