nlp-with-transformers / notebooks

Jupyter notebooks for the Natural Language Processing with Transformers book
https://transformersbook.com/
Apache License 2.0
3.7k stars 1.13k forks source link

ChildFailedError codeparrot_training.py FAILED KeyError: 'backend' #99

Open ishaansharma opened 1 year ago

ishaansharma commented 1 year ago

Information

The problem arises in chapter:

Describe the bug

To Reproduce

Steps to reproduce the behavior:

  1. git clone https://huggingface.co/transformersbook/codeparrot
  2. cd codeparrot
  3. pip install -r requirements.txt
  4. wandb login
  5. accelerate config
  6. accelerate launch codeparrot_training.py
āœ¦ šŸ•™ 11:41:06 āÆ accelerate launch codeparrot_training.py
2023-05-10 11:45:58.950271: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
[I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on[::ffff:127.0.0.1]:44970.
[I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on[::ffff:127.0.0.1]:44986.
2023-05-10 11:46:03.976884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
2023-05-10 11:46:03.993856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
ā•­ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ Traceback (most recent call last) ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā•®
ā”‚ /mnt/ssd2/tf_gpu_docker/ground0/git_repo/codeparrot/codeparrot_training.py:115 in     ā”‚
ā”‚ <module>                                                                              ā”‚
ā”‚                                                                                       ā”‚
ā”‚   112 ā”‚   return loss.item(), perplexity.item()                                       ā”‚
ā”‚   113                                                                                 ā”‚
ā”‚   114 # Accelerator                                                                   ā”‚
ā”‚ ā± 115 accelerator = Accelerator(dispatch_batches=True)                                ā”‚
ā”‚   116 acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}     ā”‚
ā”‚   117 # Hyperparameters                                                               ā”‚
ā”‚   118 project_name = 'transformersbook/codeparrot'                                    ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/accelerator. ā”‚
ā”‚ py:358 in __init__                                                                    ā”‚
ā”‚                                                                                       ā”‚
ā”‚    355 ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   self.fp8_recipe_handler = handler                      ā”‚
ā”‚    356 ā”‚   ā”‚                                                                          ā”‚
ā”‚    357 ā”‚   ā”‚   kwargs = self.init_handler.to_kwargs() if self.init_handler is not Non ā”‚
ā”‚ ā±  358 ā”‚   ā”‚   self.state = AcceleratorState(                                         ā”‚
ā”‚    359 ā”‚   ā”‚   ā”‚   mixed_precision=mixed_precision,                                   ā”‚
ā”‚    360 ā”‚   ā”‚   ā”‚   cpu=cpu,                                                           ā”‚
ā”‚    361 ā”‚   ā”‚   ā”‚   dynamo_plugin=dynamo_plugin,                                       ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:535 ā”‚
ā”‚ in __init__                                                                           ā”‚
ā”‚                                                                                       ā”‚
ā”‚   532 ā”‚   ā”‚   if parse_flag_from_env("ACCELERATE_USE_CPU"):                           ā”‚
ā”‚   533 ā”‚   ā”‚   ā”‚   cpu = True                                                          ā”‚
ā”‚   534 ā”‚   ā”‚   if PartialState._shared_state == {}:                                    ā”‚
ā”‚ ā± 535 ā”‚   ā”‚   ā”‚   PartialState(cpu, **kwargs)                                         ā”‚
ā”‚   536 ā”‚   ā”‚   self.__dict__.update(PartialState._shared_state)                        ā”‚
ā”‚   537 ā”‚   ā”‚   self._check_initialized(mixed_precision, cpu)                           ā”‚
ā”‚   538 ā”‚   ā”‚   if not self.initialized:                                                ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:130 ā”‚
ā”‚ in __init__                                                                           ā”‚
ā”‚                                                                                       ā”‚
ā”‚   127 ā”‚   ā”‚   ā”‚   elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:       ā”‚
ā”‚   128 ā”‚   ā”‚   ā”‚   ā”‚   self.distributed_type = DistributedType.MULTI_GPU               ā”‚
ā”‚   129 ā”‚   ā”‚   ā”‚   ā”‚   if not torch.distributed.is_initialized():                      ā”‚
ā”‚ ā± 130 ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   self.backend = kwargs.pop("backend")                        ā”‚
ā”‚   131 ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   torch.distributed.init_process_group(backend=self.backend,  ā”‚
ā”‚   132 ā”‚   ā”‚   ā”‚   ā”‚   self.num_processes = torch.distributed.get_world_size()         ā”‚
ā”‚   133 ā”‚   ā”‚   ā”‚   ā”‚   self.process_index = torch.distributed.get_rank()               ā”‚
ā•°ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā•Æ
KeyError: 'backend'
ā•­ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ Traceback (most recent call last) ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā•®
ā”‚ /mnt/ssd2/tf_gpu_docker/ground0/git_repo/codeparrot/codeparrot_training.py:115 in     ā”‚
ā”‚ <module>                                                                              ā”‚
ā”‚                                                                                       ā”‚
ā”‚   112 ā”‚   return loss.item(), perplexity.item()                                       ā”‚
ā”‚   113                                                                                 ā”‚
ā”‚   114 # Accelerator                                                                   ā”‚
ā”‚ ā± 115 accelerator = Accelerator(dispatch_batches=True)                                ā”‚
ā”‚   116 acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()}     ā”‚
ā”‚   117 # Hyperparameters                                                               ā”‚
ā”‚   118 project_name = 'transformersbook/codeparrot'                                    ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/accelerator. ā”‚
ā”‚ py:358 in __init__                                                                    ā”‚
ā”‚                                                                                       ā”‚
ā”‚    355 ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   self.fp8_recipe_handler = handler                      ā”‚
ā”‚    356 ā”‚   ā”‚                                                                          ā”‚
ā”‚    357 ā”‚   ā”‚   kwargs = self.init_handler.to_kwargs() if self.init_handler is not Non ā”‚
ā”‚ ā±  358 ā”‚   ā”‚   self.state = AcceleratorState(                                         ā”‚
ā”‚    359 ā”‚   ā”‚   ā”‚   mixed_precision=mixed_precision,                                   ā”‚
ā”‚    360 ā”‚   ā”‚   ā”‚   cpu=cpu,                                                           ā”‚
ā”‚    361 ā”‚   ā”‚   ā”‚   dynamo_plugin=dynamo_plugin,                                       ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:535 ā”‚
ā”‚ in __init__                                                                           ā”‚
ā”‚                                                                                       ā”‚
ā”‚   532 ā”‚   ā”‚   if parse_flag_from_env("ACCELERATE_USE_CPU"):                           ā”‚
ā”‚   533 ā”‚   ā”‚   ā”‚   cpu = True                                                          ā”‚
ā”‚   534 ā”‚   ā”‚   if PartialState._shared_state == {}:                                    ā”‚
ā”‚ ā± 535 ā”‚   ā”‚   ā”‚   PartialState(cpu, **kwargs)                                         ā”‚
ā”‚   536 ā”‚   ā”‚   self.__dict__.update(PartialState._shared_state)                        ā”‚
ā”‚   537 ā”‚   ā”‚   self._check_initialized(mixed_precision, cpu)                           ā”‚
ā”‚   538 ā”‚   ā”‚   if not self.initialized:                                                ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:130 ā”‚
ā”‚ in __init__                                                                           ā”‚
ā”‚                                                                                       ā”‚
ā”‚   127 ā”‚   ā”‚   ā”‚   elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu:       ā”‚
ā”‚   128 ā”‚   ā”‚   ā”‚   ā”‚   self.distributed_type = DistributedType.MULTI_GPU               ā”‚
ā”‚   129 ā”‚   ā”‚   ā”‚   ā”‚   if not torch.distributed.is_initialized():                      ā”‚
ā”‚ ā± 130 ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   self.backend = kwargs.pop("backend")                        ā”‚
ā”‚   131 ā”‚   ā”‚   ā”‚   ā”‚   ā”‚   torch.distributed.init_process_group(backend=self.backend,  ā”‚
ā”‚   132 ā”‚   ā”‚   ā”‚   ā”‚   self.num_processes = torch.distributed.get_world_size()         ā”‚
ā”‚   133 ā”‚   ā”‚   ā”‚   ā”‚   self.process_index = torch.distributed.get_rank()               ā”‚
ā•°ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā•Æ
KeyError: 'backend'
[11:46:07] ERROR    failed (exitcode: 1) local_rank: 0 (pid: 103878) of        api.py:672
                    binary: /home/anaconda3/envs/lab/bin/python
ā•­ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ Traceback (most recent call last) ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā•®
ā”‚ /home/anaconda3/envs/lab/bin/accelerate:8 in <module>                        ā”‚
ā”‚                                                                                       ā”‚
ā”‚   5 from accelerate.commands.accelerate_cli import main                               ā”‚
ā”‚   6 if __name__ == '__main__':                                                        ā”‚
ā”‚   7 ā”‚   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])              ā”‚
ā”‚ ā± 8 ā”‚   sys.exit(main())                                                              ā”‚
ā”‚   9                                                                                   ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/acc ā”‚
ā”‚ elerate_cli.py:45 in main                                                             ā”‚
ā”‚                                                                                       ā”‚
ā”‚   42 ā”‚   ā”‚   exit(1)                                                                  ā”‚
ā”‚   43 ā”‚                                                                                ā”‚
ā”‚   44 ā”‚   # Run                                                                        ā”‚
ā”‚ ā± 45 ā”‚   args.func(args)                                                              ā”‚
ā”‚   46                                                                                  ā”‚
ā”‚   47                                                                                  ā”‚
ā”‚   48 if __name__ == "__main__":                                                       ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/lau ā”‚
ā”‚ nch.py:909 in launch_command                                                          ā”‚
ā”‚                                                                                       ā”‚
ā”‚   906 ā”‚   elif args.use_megatron_lm and not args.cpu:                                 ā”‚
ā”‚   907 ā”‚   ā”‚   multi_gpu_launcher(args)                                                ā”‚
ā”‚   908 ā”‚   elif args.multi_gpu and not args.cpu:                                       ā”‚
ā”‚ ā± 909 ā”‚   ā”‚   multi_gpu_launcher(args)                                                ā”‚
ā”‚   910 ā”‚   elif args.tpu and not args.cpu:                                             ā”‚
ā”‚   911 ā”‚   ā”‚   if args.tpu_use_cluster:                                                ā”‚
ā”‚   912 ā”‚   ā”‚   ā”‚   tpu_pod_launcher(args)                                              ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/lau ā”‚
ā”‚ nch.py:604 in multi_gpu_launcher                                                      ā”‚
ā”‚                                                                                       ā”‚
ā”‚   601 ā”‚   )                                                                           ā”‚
ā”‚   602 ā”‚   with patch_environment(**current_env):                                      ā”‚
ā”‚   603 ā”‚   ā”‚   try:                                                                    ā”‚
ā”‚ ā± 604 ā”‚   ā”‚   ā”‚   distrib_run.run(args)                                               ā”‚
ā”‚   605 ā”‚   ā”‚   except Exception:                                                       ā”‚
ā”‚   606 ā”‚   ā”‚   ā”‚   if is_rich_available() and debug:                                   ā”‚
ā”‚   607 ā”‚   ā”‚   ā”‚   ā”‚   console = get_console()                                         ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/run.p ā”‚
ā”‚ y:785 in run                                                                          ā”‚
ā”‚                                                                                       ā”‚
ā”‚   782 ā”‚   ā”‚   )                                                                       ā”‚
ā”‚   783 ā”‚                                                                               ā”‚
ā”‚   784 ā”‚   config, cmd, cmd_args = config_from_args(args)                              ā”‚
ā”‚ ā± 785 ā”‚   elastic_launch(                                                             ā”‚
ā”‚   786 ā”‚   ā”‚   config=config,                                                          ā”‚
ā”‚   787 ā”‚   ā”‚   entrypoint=cmd,                                                         ā”‚
ā”‚   788 ā”‚   )(*cmd_args)                                                                ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/launc ā”‚
ā”‚ her/api.py:134 in __call__                                                            ā”‚
ā”‚                                                                                       ā”‚
ā”‚   131 ā”‚   ā”‚   self._entrypoint = entrypoint                                           ā”‚
ā”‚   132 ā”‚                                                                               ā”‚
ā”‚   133 ā”‚   def __call__(self, *args):                                                  ā”‚
ā”‚ ā± 134 ā”‚   ā”‚   return launch_agent(self._config, self._entrypoint, list(args))         ā”‚
ā”‚   135                                                                                 ā”‚
ā”‚   136                                                                                 ā”‚
ā”‚   137 def _get_entrypoint_name(                                                       ā”‚
ā”‚                                                                                       ā”‚
ā”‚ /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/launc ā”‚
ā”‚ her/api.py:250 in launch_agent                                                        ā”‚
ā”‚                                                                                       ā”‚
ā”‚   247 ā”‚   ā”‚   ā”‚   # if the error files for the failed children exist                  ā”‚
ā”‚   248 ā”‚   ā”‚   ā”‚   # @record will copy the first error (root cause)                    ā”‚
ā”‚   249 ā”‚   ā”‚   ā”‚   # to the error file of the launcher process.                        ā”‚
ā”‚ ā± 250 ā”‚   ā”‚   ā”‚   raise ChildFailedError(                                             ā”‚
ā”‚   251 ā”‚   ā”‚   ā”‚   ā”‚   name=entrypoint_name,                                           ā”‚
ā”‚   252 ā”‚   ā”‚   ā”‚   ā”‚   failures=result.failures,                                       ā”‚
ā”‚   253 ā”‚   ā”‚   ā”‚   )                                                                   ā”‚
ā•°ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā•Æ
ChildFailedError:
============================================================
codeparrot_training.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-05-10_11:46:07
  host      : YODA
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 103879)
  error_file: <N/A>
  traceback : To enable traceback see:
https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-10_11:46:07
  host      : YODA
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 103878)
  error_file: <N/A>
  traceback : To enable traceback see:
https://pytorch.org/docs/stable/elastic/errors.html
============================================================

My accelerate config file :

āœ¦ šŸ•™ 12:13:56 āœ–  cat /home/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: '[0,1]'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

My GPU details:

āœ¦ šŸ•™ 12:14:05 āÆ nvidia-smi
Wed May 10 12:15:24 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:09:00.0 Off |                  N/A |
| 36%   32C    P8     1W / 250W |     10MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:42:00.0  On |                  N/A |
| 36%   36C    P8    17W / 250W |    418MiB / 11264MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Expected behavior

I want to run the code on multi gpu in single node/machine. @lewtun , @sgugger can you help me run this .