GPU Memory question - Githubissues

uclaml / SPIN

The official implementation of Self-Play Fine-Tuning (SPIN)

Apache License 2.0

1.05k stars 92 forks source link

Hello! Thanks for the open-sourced code release. I have been trying to run the fine-tuning with a phi-2 3B model on a 40GB A100 GPU, while running accelerate launch spin/run_spin.py configs/config.yaml I get GPU run out of memory errors, which really confuses me, since I have set the batch size to 1, and 1 for all the number of processes here. I can not imagine what is consuming so much memory:

[INFO|trainer.py:571] 2024-02-29 14:04:29,359 >> Using auto half precision backend [INFO|trainer.py:1721] 2024-02-29 14:04:32,728 >> Running training [INFO|trainer.py:1722] 2024-02-29 14:04:32,728 >> Num examples = 20 [INFO|trainer.py:1723] 2024-02-29 14:04:32,728 >> Num Epochs = 3 [INFO|trainer.py:1724] 2024-02-29 14:04:32,728 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2024-02-29 14:04:32,728 >> Total train batch size (w. parallel, distributed & accumulation) = 1 [INFO|trainer.py:1728] 2024-02-29 14:04:32,728 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1729] 2024-02-29 14:04:32,728 >> Total optimization steps = 60 [INFO|trainer.py:1730] 2024-02-29 14:04:32,729 >> Number of trainable parameters = 2,779,683,840 0% 0/60 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( [WARNING|modeling_utils.py:1126] 2024-02-29 14:04:34,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed Traceback (most recent call last): File "/content/SPIN/spin/run_spin.py", line 206, in main() File "/content/SPIN/spin/run_spin.py", line 169, in main train_result = spin_trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1917, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 145, in step self.optimizer.step(closure) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 373, in wrapper out = func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 115, in step self._init_group(group, params_with_grad, grads, square_avgs, momentum_buffer_list, grad_avgs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 72, in _init_group state["square_avg"] = torch.zeros_like( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 64.81 MiB is free. Process 468145 has 39.49 GiB memory in use. Of the allocated memory 37.54 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0% 0/60 [00:01<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'spin/run_spin.py', 'configs/config.yaml']' returned non-zero exit status 1.

Hello! Thanks for the open-sourced code release. I have been trying to run the fine-tuning with a phi-2 3B model on a 40GB A100 GPU, while running accelerate launch spin/run_spin.py configs/config.yaml I get GPU run out of memory errors, which really confuses me, since I have set the batch size to 1, and 1 for all the number of processes here. I can not imagine what is consuming so much memory:

[INFO|trainer.py:571] 2024-02-29 14:04:29,359 >> Using auto half precision backend [INFO|trainer.py:1721] 2024-02-29 14:04:32,728 >> Running training [INFO|trainer.py:1722] 2024-02-29 14:04:32,728 >> Num examples = 20 [INFO|trainer.py:1723] 2024-02-29 14:04:32,728 >> Num Epochs = 3 [INFO|trainer.py:1724] 2024-02-29 14:04:32,728 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2024-02-29 14:04:32,728 >> Total train batch size (w. parallel, distributed & accumulation) = 1 [INFO|trainer.py:1728] 2024-02-29 14:04:32,728 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1729] 2024-02-29 14:04:32,728 >> Total optimization steps = 60 [INFO|trainer.py:1730] 2024-02-29 14:04:32,729 >> Number of trainable parameters = 2,779,683,840 0% 0/60 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( [WARNING|modeling_utils.py:1126] 2024-02-29 14:04:34,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed Traceback (most recent call last): File "/content/SPIN/spin/run_spin.py", line 206, in main() File "/content/SPIN/spin/run_spin.py", line 169, in main train_result = spin_trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1917, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 145, in step self.optimizer.step(closure) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 373, in wrapper out = func(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 115, in step self._init_group(group, params_with_grad, grads, square_avgs, momentum_buffer_list, grad_avgs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 72, in _init_group state["square_avg"] = torch.zeros_like( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 64.81 MiB is free. Process 468145 has 39.49 GiB memory in use. Of the allocated memory 37.54 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0% 0/60 [00:01<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'spin/run_spin.py', 'configs/config.yaml']' returned non-zero exit status 1.

You might need to specify DeepSpeed configuration. Check scripts/finetune.sh for the command. Let us know if you are still having problems.

uclaml / SPIN

GPU Memory question #21