ChildFailedError codeparrot_training.py FAILED KeyError: 'backend'

Information

The problem arises in chapter:

[ ] Introduction
[ ] Text Classification
[ ] Transformer Anatomy
[ ] Multilingual Named Entity Recognition
[ ] Text Generation
[ ] Summarization
[ ] Question Answering
[ ] Making Transformers Efficient in Production
[ ] Dealing with Few to No Labels
[x] Training Transformers from Scratch
[ ] Future Directions

Describe the bug

To Reproduce

Steps to reproduce the behavior:

git clone https://huggingface.co/transformersbook/codeparrot
cd codeparrot
pip install -r requirements.txt
wandb login
accelerate config
accelerate launch codeparrot_training.py

✦ 🕙 11:41:06 ❯ accelerate launch codeparrot_training.py
2023-05-10 11:45:58.950271: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on[::ffff:127.0.0.1]:44970.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on[::ffff:127.0.0.1]:44986.
2023-05-10 11:46:03.976884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2023-05-10 11:46:03.993856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
╭────────────────────────── Traceback (most recent call last) ──────────────────────────╮
│ /mnt/ssd2/tf_gpu_docker/ground0/git_repo/codeparrot/codeparrot_training.py:115 in     │
│ <module>                                                                              │
│                                                                                       │
│   112 │   return loss.item(), perplexity.item()                                       │
│   113                                                                                 │
│   114 # Accelerator                                                                   │
│ ❱ 115 accelerator = Accelerator(dispatch_batches=True)                                │
│   116 acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}     │
│   117 # Hyperparameters                                                               │
│   118 project_name = 'transformersbook/codeparrot'                                    │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/accelerator. │
│ py:358 in __init__                                                                    │
│                                                                                       │
│    355 │   │   │   │   │   │   self.fp8_recipe_handler = handler                      │
│    356 │   │                                                                          │
│    357 │   │   kwargs = self.init_handler.to_kwargs() if self.init_handler is not Non │
│ ❱  358 │   │   self.state = AcceleratorState(                                         │
│    359 │   │   │   mixed_precision=mixed_precision,                                   │
│    360 │   │   │   cpu=cpu,                                                           │
│    361 │   │   │   dynamo_plugin=dynamo_plugin,                                       │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:535 │
│ in __init__                                                                           │
│                                                                                       │
│   532 │   │   if parse_flag_from_env("ACCELERATE_USE_CPU"):                           │
│   533 │   │   │   cpu = True                                                          │
│   534 │   │   if PartialState._shared_state == {}:                                    │
│ ❱ 535 │   │   │   PartialState(cpu, **kwargs)                                         │
│   536 │   │   self.__dict__.update(PartialState._shared_state)                        │
│   537 │   │   self._check_initialized(mixed_precision, cpu)                           │
│   538 │   │   if not self.initialized:                                                │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:130 │
│ in __init__                                                                           │
│                                                                                       │
│   127 │   │   │   elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:       │
│   128 │   │   │   │   self.distributed_type = DistributedType.MULTI_GPU               │
│   129 │   │   │   │   if not torch.distributed.is_initialized():                      │
│ ❱ 130 │   │   │   │   │   self.backend = kwargs.pop("backend")                        │
│   131 │   │   │   │   │   torch.distributed.init_process_group(backend=self.backend,  │
│   132 │   │   │   │   self.num_processes = torch.distributed.get_world_size()         │
│   133 │   │   │   │   self.process_index = torch.distributed.get_rank()               │
╰───────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'backend'
╭────────────────────────── Traceback (most recent call last) ──────────────────────────╮
│ /mnt/ssd2/tf_gpu_docker/ground0/git_repo/codeparrot/codeparrot_training.py:115 in     │
│ <module>                                                                              │
│                                                                                       │
│   112 │   return loss.item(), perplexity.item()                                       │
│   113                                                                                 │
│   114 # Accelerator                                                                   │
│ ❱ 115 accelerator = Accelerator(dispatch_batches=True)                                │
│   116 acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}     │
│   117 # Hyperparameters                                                               │
│   118 project_name = 'transformersbook/codeparrot'                                    │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/accelerator. │
│ py:358 in __init__                                                                    │
│                                                                                       │
│    355 │   │   │   │   │   │   self.fp8_recipe_handler = handler                      │
│    356 │   │                                                                          │
│    357 │   │   kwargs = self.init_handler.to_kwargs() if self.init_handler is not Non │
│ ❱  358 │   │   self.state = AcceleratorState(                                         │
│    359 │   │   │   mixed_precision=mixed_precision,                                   │
│    360 │   │   │   cpu=cpu,                                                           │
│    361 │   │   │   dynamo_plugin=dynamo_plugin,                                       │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:535 │
│ in __init__                                                                           │
│                                                                                       │
│   532 │   │   if parse_flag_from_env("ACCELERATE_USE_CPU"):                           │
│   533 │   │   │   cpu = True                                                          │
│   534 │   │   if PartialState._shared_state == {}:                                    │
│ ❱ 535 │   │   │   PartialState(cpu, **kwargs)                                         │
│   536 │   │   self.__dict__.update(PartialState._shared_state)                        │
│   537 │   │   self._check_initialized(mixed_precision, cpu)                           │
│   538 │   │   if not self.initialized:                                                │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:130 │
│ in __init__                                                                           │
│                                                                                       │
│   127 │   │   │   elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:       │
│   128 │   │   │   │   self.distributed_type = DistributedType.MULTI_GPU               │
│   129 │   │   │   │   if not torch.distributed.is_initialized():                      │
│ ❱ 130 │   │   │   │   │   self.backend = kwargs.pop("backend")                        │
│   131 │   │   │   │   │   torch.distributed.init_process_group(backend=self.backend,  │
│   132 │   │   │   │   self.num_processes = torch.distributed.get_world_size()         │
│   133 │   │   │   │   self.process_index = torch.distributed.get_rank()               │
╰───────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'backend'
[11:46:07] ERROR    failed (exitcode: 1) local_rank: 0 (pid: 103878) of        api.py:672
                    binary: /home/anaconda3/envs/lab/bin/python
╭────────────────────────── Traceback (most recent call last) ──────────────────────────╮
│ /home/anaconda3/envs/lab/bin/accelerate:8 in <module>                        │
│                                                                                       │
│   5 from accelerate.commands.accelerate_cli import main                               │
│   6 if __name__ == '__main__':                                                        │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])              │
│ ❱ 8 │   sys.exit(main())                                                              │
│   9                                                                                   │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/acc │
│ elerate_cli.py:45 in main                                                             │
│                                                                                       │
│   42 │   │   exit(1)                                                                  │
│   43 │                                                                                │
│   44 │   # Run                                                                        │
│ ❱ 45 │   args.func(args)                                                              │
│   46                                                                                  │
│   47                                                                                  │
│   48 if __name__ == "__main__":                                                       │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/lau │
│ nch.py:909 in launch_command                                                          │
│                                                                                       │
│   906 │   elif args.use_megatron_lm and not args.cpu:                                 │
│   907 │   │   multi_gpu_launcher(args)                                                │
│   908 │   elif args.multi_gpu and not args.cpu:                                       │
│ ❱ 909 │   │   multi_gpu_launcher(args)                                                │
│   910 │   elif args.tpu and not args.cpu:                                             │
│   911 │   │   if args.tpu_use_cluster:                                                │
│   912 │   │   │   tpu_pod_launcher(args)                                              │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/lau │
│ nch.py:604 in multi_gpu_launcher                                                      │
│                                                                                       │
│   601 │   )                                                                           │
│   602 │   with patch_environment(**current_env):                                      │
│   603 │   │   try:                                                                    │
│ ❱ 604 │   │   │   distrib_run.run(args)                                               │
│   605 │   │   except Exception:                                                       │
│   606 │   │   │   if is_rich_available() and debug:                                   │
│   607 │   │   │   │   console = get_console()                                         │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/run.p │
│ y:785 in run                                                                          │
│                                                                                       │
│   782 │   │   )                                                                       │
│   783 │                                                                               │
│   784 │   config, cmd, cmd_args = config_from_args(args)                              │
│ ❱ 785 │   elastic_launch(                                                             │
│   786 │   │   config=config,                                                          │
│   787 │   │   entrypoint=cmd,                                                         │
│   788 │   )(*cmd_args)                                                                │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/launc │
│ her/api.py:134 in __call__                                                            │
│                                                                                       │
│   131 │   │   self._entrypoint = entrypoint                                           │
│   132 │                                                                               │
│   133 │   def __call__(self, *args):                                                  │
│ ❱ 134 │   │   return launch_agent(self._config, self._entrypoint, list(args))         │
│   135                                                                                 │
│   136                                                                                 │
│   137 def _get_entrypoint_name(                                                       │
│                                                                                       │
│ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/launc │
│ her/api.py:250 in launch_agent                                                        │
│                                                                                       │
│   247 │   │   │   # if the error files for the failed children exist                  │
│   248 │   │   │   # @record will copy the first error (root cause)                    │
│   249 │   │   │   # to the error file of the launcher process.                        │
│ ❱ 250 │   │   │   raise ChildFailedError(                                             │
│   251 │   │   │   │   name=entrypoint_name,                                           │
│   252 │   │   │   │   failures=result.failures,                                       │
│   253 │   │   │   )                                                                   │
╰───────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError:
============================================================
codeparrot_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-05-10_11:46:07
  host      : YODA
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 103879)
  error_file: <N/A>
  traceback : To enable traceback see:
https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-10_11:46:07
  host      : YODA
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 103878)
  error_file: <N/A>
  traceback : To enable traceback see:
https://pytorch.org/docs/stable/elastic/errors.html
============================================================

My accelerate config file :

✦ 🕙 12:13:56 ✖  cat /home/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '[0,1]'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

My GPU details:

✦ 🕙 12:14:05 ❯ nvidia-smi
Wed May 10 12:15:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
| 36%   32C    P8     1W / 250W |     10MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:42:00.0  On |                  N/A |
| 36%   36C    P8    17W / 250W |    418MiB / 11264MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Expected behavior

I want to run the code on multi gpu in single node/machine. @lewtun , @sgugger can you help me run this .

nlp-with-transformers / notebooks

ChildFailedError codeparrot_training.py FAILED KeyError: 'backend' #99

Information

Describe the bug

To Reproduce

Expected behavior