Open ishaansharma opened 1 year ago
The problem arises in chapter:
Steps to reproduce the behavior:
ā¦ š 11:41:06 āÆ accelerate launch codeparrot_training.py 2023-05-10 11:45:58.950271: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 [I socket.cpp:566] [c10d] The server socket has started to listen on [::]:29500. [I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on[::ffff:127.0.0.1]:44970. [I socket.cpp:787] [c10d] The client socket has connected to [::ffff:127.0.0.1]:29500 on[::ffff:127.0.0.1]:44986. 2023-05-10 11:46:03.976884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 2023-05-10 11:46:03.993856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1 āāāāāāāāāāāāāāāāāāāāāāāāāāā Traceback (most recent call last) āāāāāāāāāāāāāāāāāāāāāāāāāāā® ā /mnt/ssd2/tf_gpu_docker/ground0/git_repo/codeparrot/codeparrot_training.py:115 in ā ā <module> ā ā ā ā 112 ā return loss.item(), perplexity.item() ā ā 113 ā ā 114 # Accelerator ā ā ā± 115 accelerator = Accelerator(dispatch_batches=True) ā ā 116 acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()} ā ā 117 # Hyperparameters ā ā 118 project_name = 'transformersbook/codeparrot' ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/accelerator. ā ā py:358 in __init__ ā ā ā ā 355 ā ā ā ā ā ā self.fp8_recipe_handler = handler ā ā 356 ā ā ā ā 357 ā ā kwargs = self.init_handler.to_kwargs() if self.init_handler is not Non ā ā ā± 358 ā ā self.state = AcceleratorState( ā ā 359 ā ā ā mixed_precision=mixed_precision, ā ā 360 ā ā ā cpu=cpu, ā ā 361 ā ā ā dynamo_plugin=dynamo_plugin, ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:535 ā ā in __init__ ā ā ā ā 532 ā ā if parse_flag_from_env("ACCELERATE_USE_CPU"): ā ā 533 ā ā ā cpu = True ā ā 534 ā ā if PartialState._shared_state == {}: ā ā ā± 535 ā ā ā PartialState(cpu, **kwargs) ā ā 536 ā ā self.__dict__.update(PartialState._shared_state) ā ā 537 ā ā self._check_initialized(mixed_precision, cpu) ā ā 538 ā ā if not self.initialized: ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:130 ā ā in __init__ ā ā ā ā 127 ā ā ā elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu: ā ā 128 ā ā ā ā self.distributed_type = DistributedType.MULTI_GPU ā ā 129 ā ā ā ā if not torch.distributed.is_initialized(): ā ā ā± 130 ā ā ā ā ā self.backend = kwargs.pop("backend") ā ā 131 ā ā ā ā ā torch.distributed.init_process_group(backend=self.backend, ā ā 132 ā ā ā ā self.num_processes = torch.distributed.get_world_size() ā ā 133 ā ā ā ā self.process_index = torch.distributed.get_rank() ā ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ KeyError: 'backend' āāāāāāāāāāāāāāāāāāāāāāāāāāā Traceback (most recent call last) āāāāāāāāāāāāāāāāāāāāāāāāāāā® ā /mnt/ssd2/tf_gpu_docker/ground0/git_repo/codeparrot/codeparrot_training.py:115 in ā ā <module> ā ā ā ā 112 ā return loss.item(), perplexity.item() ā ā 113 ā ā 114 # Accelerator ā ā ā± 115 accelerator = Accelerator(dispatch_batches=True) ā ā 116 acc_state = {str(k): str(v) for k, v in accelerator.state.__dict__.items()} ā ā 117 # Hyperparameters ā ā 118 project_name = 'transformersbook/codeparrot' ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/accelerator. ā ā py:358 in __init__ ā ā ā ā 355 ā ā ā ā ā ā self.fp8_recipe_handler = handler ā ā 356 ā ā ā ā 357 ā ā kwargs = self.init_handler.to_kwargs() if self.init_handler is not Non ā ā ā± 358 ā ā self.state = AcceleratorState( ā ā 359 ā ā ā mixed_precision=mixed_precision, ā ā 360 ā ā ā cpu=cpu, ā ā 361 ā ā ā dynamo_plugin=dynamo_plugin, ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:535 ā ā in __init__ ā ā ā ā 532 ā ā if parse_flag_from_env("ACCELERATE_USE_CPU"): ā ā 533 ā ā ā cpu = True ā ā 534 ā ā if PartialState._shared_state == {}: ā ā ā± 535 ā ā ā PartialState(cpu, **kwargs) ā ā 536 ā ā self.__dict__.update(PartialState._shared_state) ā ā 537 ā ā self._check_initialized(mixed_precision, cpu) ā ā 538 ā ā if not self.initialized: ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/state.py:130 ā ā in __init__ ā ā ā ā 127 ā ā ā elif int(os.environ.get("LOCAL_RANK", -1)) != -1 and not cpu: ā ā 128 ā ā ā ā self.distributed_type = DistributedType.MULTI_GPU ā ā 129 ā ā ā ā if not torch.distributed.is_initialized(): ā ā ā± 130 ā ā ā ā ā self.backend = kwargs.pop("backend") ā ā 131 ā ā ā ā ā torch.distributed.init_process_group(backend=self.backend, ā ā 132 ā ā ā ā self.num_processes = torch.distributed.get_world_size() ā ā 133 ā ā ā ā self.process_index = torch.distributed.get_rank() ā ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ KeyError: 'backend' [11:46:07] ERROR failed (exitcode: 1) local_rank: 0 (pid: 103878) of api.py:672 binary: /home/anaconda3/envs/lab/bin/python āāāāāāāāāāāāāāāāāāāāāāāāāāā Traceback (most recent call last) āāāāāāāāāāāāāāāāāāāāāāāāāāā® ā /home/anaconda3/envs/lab/bin/accelerate:8 in <module> ā ā ā ā 5 from accelerate.commands.accelerate_cli import main ā ā 6 if __name__ == '__main__': ā ā 7 ā sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) ā ā ā± 8 ā sys.exit(main()) ā ā 9 ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/acc ā ā elerate_cli.py:45 in main ā ā ā ā 42 ā ā exit(1) ā ā 43 ā ā ā 44 ā # Run ā ā ā± 45 ā args.func(args) ā ā 46 ā ā 47 ā ā 48 if __name__ == "__main__": ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/lau ā ā nch.py:909 in launch_command ā ā ā ā 906 ā elif args.use_megatron_lm and not args.cpu: ā ā 907 ā ā multi_gpu_launcher(args) ā ā 908 ā elif args.multi_gpu and not args.cpu: ā ā ā± 909 ā ā multi_gpu_launcher(args) ā ā 910 ā elif args.tpu and not args.cpu: ā ā 911 ā ā if args.tpu_use_cluster: ā ā 912 ā ā ā tpu_pod_launcher(args) ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/accelerate/commands/lau ā ā nch.py:604 in multi_gpu_launcher ā ā ā ā 601 ā ) ā ā 602 ā with patch_environment(**current_env): ā ā 603 ā ā try: ā ā ā± 604 ā ā ā distrib_run.run(args) ā ā 605 ā ā except Exception: ā ā 606 ā ā ā if is_rich_available() and debug: ā ā 607 ā ā ā ā console = get_console() ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/run.p ā ā y:785 in run ā ā ā ā 782 ā ā ) ā ā 783 ā ā ā 784 ā config, cmd, cmd_args = config_from_args(args) ā ā ā± 785 ā elastic_launch( ā ā 786 ā ā config=config, ā ā 787 ā ā entrypoint=cmd, ā ā 788 ā )(*cmd_args) ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/launc ā ā her/api.py:134 in __call__ ā ā ā ā 131 ā ā self._entrypoint = entrypoint ā ā 132 ā ā ā 133 ā def __call__(self, *args): ā ā ā± 134 ā ā return launch_agent(self._config, self._entrypoint, list(args)) ā ā 135 ā ā 136 ā ā 137 def _get_entrypoint_name( ā ā ā ā /home/anaconda3/envs/lab/lib/python3.9/site-packages/torch/distributed/launc ā ā her/api.py:250 in launch_agent ā ā ā ā 247 ā ā ā # if the error files for the failed children exist ā ā 248 ā ā ā # @record will copy the first error (root cause) ā ā 249 ā ā ā # to the error file of the launcher process. ā ā ā± 250 ā ā ā raise ChildFailedError( ā ā 251 ā ā ā ā name=entrypoint_name, ā ā 252 ā ā ā ā failures=result.failures, ā ā 253 ā ā ā ) ā ā°āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāÆ ChildFailedError: ============================================================ codeparrot_training.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-05-10_11:46:07 host : YODA rank : 1 (local_rank: 1) exitcode : 1 (pid: 103879) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-05-10_11:46:07 host : YODA rank : 0 (local_rank: 0) exitcode : 1 (pid: 103878) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
My accelerate config file :
ā¦ š 12:13:56 ā cat /home/.cache/huggingface/accelerate/default_config.yaml compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU downcast_bf16: 'no' gpu_ids: '[0,1]' machine_rank: 0 main_training_function: main mixed_precision: fp16 num_machines: 1 num_processes: 2 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false
My GPU details:
ā¦ š 12:14:05 āÆ nvidia-smi Wed May 10 12:15:24 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:09:00.0 Off | N/A | | 36% 32C P8 1W / 250W | 10MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:42:00.0 On | N/A | | 36% 36C P8 17W / 250W | 418MiB / 11264MiB | 9% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
I want to run the code on multi gpu in single node/machine. @lewtun , @sgugger can you help me run this .
Information
The problem arises in chapter:
Describe the bug
To Reproduce
Steps to reproduce the behavior:
My accelerate config file :
My GPU details:
Expected behavior
I want to run the code on multi gpu in single node/machine. @lewtun , @sgugger can you help me run this .