microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.82k stars 4.05k forks source link

OOM error libcurand.so.10 #3279

Closed menkeyi closed 1 year ago

menkeyi commented 1 year ago

Machine configuration information

(deepspeed) [menkeyi@gpu1 ~]$ df -Th Filesystem Type Size Used Avail Use% Mounted on none overlay 79G 29G 46G 39% / 192.168.100.44@o2ib:/data lustre 98T 4.3T 89T 5% /home dev devtmpfs 991M 0 991M 0% /dev tmpfs tmpfs 504G 646M 504G 1% /dev/shm tmpfs tmpfs 504G 19M 504G 1% /run tmpfs tmpfs 504G 0 504G 0% /sys/fs/cgroup

(deepspeed) [menkeyi@gpu1 ~]$ free -m total used free shared buff/cache available Mem: 1031741 115892 846810 1328 69039 821477 Swap: 0 0 0

(deepspeed) [menkeyi@gpu1 ~]$ cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core)

=============================train (deepspeed) [menkeyi@gpu1 DeepSpeed-Chat]$ python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type single_node ---=== Running Step 1 ===--- Running: bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/single_node/run_1.3b.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Ch at/output/actor-models/1.3b ---=== Finished Step 1 in 2:45:20 ===--- ---=== Running Step 2 ===--- Running: bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/training_scripts/single_node/run_350m.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed- Chat/output/reward-models/350m ---=== Finished Step 2 in 3:56:06 ===--- ---=== Running Step 3 ===--- Running: bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_1.3b.sh /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/ou$put/actor-models/1.3b /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m '' '' /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/1.3b Traceback (most recent call last): File "/home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 210, in main(args) File "/home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 195, in main launch_cmd(args, step_num, cmd) File "/home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 175, in launch_cmd raise RuntimeError('\n\n'.join(( RuntimeError: Step 3 exited with non-zero status 1

Launch command: bash /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_1.3b.sh /home/menkeyi/DeepSpeedExamples/applications/D$epSpeed-Chat/output/actor-models/1.3b /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m '' '' /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/st$p3-models/1.3b

Log output: /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/1.3b/training.log

Please see our tutorial at https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning

Please check that you have installed our requirements: pip install -r requirements.txt

If you are seeing an OOM error, try modifying /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_1.3b.sh:

=========================================training.log (deepspeed) [menkeyi@gpu1 ~]$ tail -f DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log [2023-04-17 22:53:45,410] [INFO] [logging.py:96:log_dist] [Rank 0] step=4130, skipped=74, lr=[7.642152964180552e-09, 7.642152964180552e-09], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-04-17 22:53:46,284] [INFO] [timer.py:199:stop] epoch=1/micro_step=2065/global_step=4130, RunningAvgSamplesPerSec=30.58440684669661, CurrSamplesPerSec=30.61588841412397, MemAllocated=4.94GB, MaxMemAllocated=23.6GB Evaluating perplexity, Epoch 2/2 ppl: 2.7952592372894287 saving the final model ... [2023-04-17 22:54:07,899] [INFO] [launch.py:460:main] Process 24672 exits successfully. [2023-04-17 22:54:07,899] [INFO] [launch.py:460:main] Process 24668 exits successfully. [2023-04-17 22:54:08,901] [INFO] [launch.py:460:main] Process 24669 exits successfully. [2023-04-17 22:54:08,901] [INFO] [launch.py:460:main] Process 24671 exits successfully. [2023-04-17 22:54:08,902] [INFO] [launch.py:460:main] Process 24673 exits successfully. [2023-04-17 22:54:09,903] [INFO] [launch.py:460:main] Process 24670 exits successfully. [2023-04-17 22:54:09,904] [INFO] [launch.py:460:main] Process 24667 exits successfully. [2023-04-17 22:54:10,905] [INFO] [launch.py:460:main] Process 24666 exits successfully.

====================================step3-models/1.3b/training.log (deepspeed) [menkeyi@gpu1 output]$ tail -f /home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/1.3b/training.log self.module = DeepSpeedOPTInference(_config, mp_group=self.mp_group) File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_opt.py", line 20, in init ImportError: libcurand.so.10: cannot open shared object file: No such file or directory super().init(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping) File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 54, in init inference_cuda_module = builder.load() File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 449, in load return self.jit_load(verbose) File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load op_module = load(name=self.name, File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1534, in _jit_compile return _import_module_from_library(name, build_directory, is_python_module) File "/home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1936, in _import_module_from_library module = importlib.util.module_from_spec(spec) File "", line 565, in module_from_spec File "", line 1173, in create_module File "", line 228, in _call_with_frames_removed ImportError: libcurand.so.10: cannot open shared object file: No such file or directory [2023-04-18 00:11:04,557] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49766 [2023-04-18 00:11:04,734] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49767 [2023-04-18 00:11:05,230] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49768 [2023-04-18 00:11:05,233] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49769 [2023-04-18 00:11:05,448] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49770 [2023-04-18 00:11:05,448] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49771 [2023-04-18 00:11:05,451] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49772 [2023-04-18 00:11:05,453] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 49773 [2023-04-18 00:11:06,069] [ERROR] [launch.py:434:sigkill_handler] ['/home/menkeyi/.conda/envs/deepspeed/bin/python', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--actor_model_name_or_path', '/home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b', '--critic_model_name_or_path', '/home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m', '--num_padding_at_beginning', '1', '--per_device_train_batch_size', '4', '--per_device_mini_train_batch_size', '4', '--generation_batch_numbers', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '256', '--max_prompt_seq_len', '256', '--actor_learning_rate', '9.65e-6', '--critic_learning_rate', '5e-6', '--actor_weight_decay', '0.1', '--critic_weight_decay', '0.1', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--gradient_accumulation_steps', '1', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--enable_hybrid_engine', '--actor_zero_stage', '2', '--critic_zero_stage', '2', '--output_dir', '/home/menkeyi/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/1.3b'] exits with return code = 1

Search results for libcurand.so.10 conda deepspeed installation method for Pytorch: pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116

(deepspeed) [menkeyi@gpu1 output]$ find / -name libcurand.so.10 。。。。。。。。。。。。。。。。。。。。 /home/menkeyi/.conda/envs/deepspeed/lib/python3.9/site-packages/nvidia/curand/lib/libcurand.so.10

loadams commented 1 year ago

Hi @menkeyi - are you still hitting this issue? Can you try again with the latest DeepSpeed? Also this can usually be resolved by reinstalling Cuda and Cudart/CuRand.

If you are still hitting this issue, can you open a new issue and link this one? Thanks!