torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

ifromeast commented 1 year ago

🐛 Describe the bug

GPU: 8*A6000 CUDA Version: 11.7 Python Version: 3.8.10 colossalai Version: 0.2.8

when I train PPO by

torchrun --standalone --nproc_per_node=8 train_prompts.py \
         --pretrain "decapoda-research/llama-7b-hf" \
         --model 'llama' \
         --strategy colossalai_zero2 \
         --prompt_path "dataset/seed_prompts_en.jsonl" \
         --pretrain_dataset 'dataset/instinwild_en.json' \
         --rm_pretrain "decapoda-research/llama-7b-hf" \
         --rm_path 'rmstatic.pt' \
         --train_batch_size 1 \
         --experience_batch_size 1

the ERROR occurs that

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:29400 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
[04/18/23 21:09:50] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                  
[04/18/23 21:09:50] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                  
[04/18/23 21:09:50] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                  
                    INFO     colossalai - colossalai - INFO: process rank 6 is bound to device 6                                           
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                           
[04/18/23 21:09:50] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                  
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                           
                    INFO     colossalai - colossalai - INFO: process rank 4 is bound to device 4                                           
[04/18/23 21:09:50] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                  
[04/18/23 21:09:50] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                  
                    INFO     colossalai - colossalai - INFO: process rank 2 is bound to device 2                                           
[04/18/23 21:09:50] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                  
                    INFO     colossalai - colossalai - INFO: process rank 7 is bound to device 7                                           
                    INFO     colossalai - colossalai - INFO: process rank 5 is bound to device 5                                           
[04/18/23 21:09:50] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                  
                    INFO     colossalai - colossalai - INFO: process rank 3 is bound to device 3                                           
[04/18/23 21:09:53] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                    
[04/18/23 21:09:53] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                    
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 42, python random: 42, ParallelMode.DATA:  
                             42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                   
[04/18/23 21:09:53] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                    
[04/18/23 21:09:53] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                    
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA:  
                             42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                   
[04/18/23 21:09:53] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                    
[04/18/23 21:09:53] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                    
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 6, numpy: 42, python random: 42, ParallelMode.DATA:  
                             42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                   
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 42, python random: 42, ParallelMode.DATA:  
                             42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                   
                    INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:115 launch    
[04/18/23 21:09:53] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                    
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 7, numpy: 42, python random: 42, ParallelMode.DATA:  
                             42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                   
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 8, pipeline       
                             parallel size: 1, tensor parallel size: 1                                                                     
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 42, python random: 42, ParallelMode.DATA:  
                             42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                   
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 5, numpy: 42, python random: 42, ParallelMode.DATA:  
                             42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                   
[04/18/23 21:09:53] INFO     colossalai - colossalai - INFO:                                                                               
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                    
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 4, numpy: 42, python random: 42, ParallelMode.DATA:  
                             42, ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                   
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1268776 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1268777 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1268778 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1268779 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1268780 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1268781 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1268783 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
train_prompts.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-18_21:10:36
  host      : BJ-G104-79-70.local
  rank      : 6 (local_rank: 6)
  exitcode  : -9 (pid: 1268782)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1268782
========================================================

what should I do to overcome it?

Environment

GPU: 8*A6000 CUDA Version: 11.7 Python Version: 3.8.10 colossalai Version: 0.2.8

ifromeast commented 1 year ago

@JThh Could you help to have a look at this problem?

JThh commented 1 year ago

Looks like there are some racing processes. Can you check ps aux | grep python and possibly kill other unused processes before running again?

ifromeast commented 1 year ago

@JThh > ps aux | grep python

I killed all the ps, but the ERROR insist

GPU Memory Usage:
     0  0 MiB
     1  0 MiB
     2  0 MiB
     3  0 MiB
     4  0 MiB
     5  0 MiB
     6  0 MiB
     7  0 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388065 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388067 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388068 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388069 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388070 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388071 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388072 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 1388066) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
train_prompts.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-19_19:36:19
  host      : BJ-G104-79-70.local
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 1388066)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 1388066
========================================================

JThh commented 1 year ago

Alright, now I'd think it is due to your main memory OOM. Can you check if dmesg -T | egrep -i 'killed process' returns any valid message?

5joon2 commented 1 year ago

dmesg -T | egrep -i 'killed process'

If message like below, it may come from memory OOM?

Killed process 57636 (python3.10) total-vm:102681144kB, anon-rss:62998680kB, file-rss:115876kB, shmem-rss:49160kB

JThh commented 1 year ago

Yes, likely.

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

Yes, likely.

JThh commented 1 year ago

Now you might want to try our other strategies to lower your main memory usage!

ifromeast commented 1 year ago

@JThh I see that colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1 it means the global batch size is 8, so how can I set DP=1 and TP=8.

JThh commented 1 year ago

This example does not support TP yet. Have you tried colossalai_gemini strategy and set placement to be cuda?

ifromeast commented 1 year ago

This example does not support TP yet. Have you tried colossalai_gemini strategy and set placement to be cuda?

@JThh Yes, but OOM. I am so curious that why OOM occurs even on 8*A6000(40GB) & BS=1, can you give more advice?

JThh commented 1 year ago

May I know when OOM happened? Was it after the model init or the start of first epoch training?

JThh commented 1 year ago

With the same strategy, how about setting placement to be ‘cpu’? Some user reported it worked.

ifromeast commented 1 year ago

With the same strategy, how about setting placement to be ‘cpu’? Some user reported it worked.

@JThh it still OOM, the following is the log

GPU Memory Usage:
     0  0 MiB
     1  0 MiB
     2  0 MiB
     3  0 MiB
     4  0 MiB
     5  0 MiB
     6  0 MiB
     7  0 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[04/28/23 13:56:11] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                    
[04/28/23 13:56:11] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                    
                    INFO     colossalai - colossalai - INFO: process rank 5 is bound to device 5                                             
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1                                             
[04/28/23 13:56:11] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                    
[04/28/23 13:56:11] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                    
                    INFO     colossalai - colossalai - INFO: process rank 7 is bound to device 7                                             
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0                                             
[04/28/23 13:56:11] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                    
                    INFO     colossalai - colossalai - INFO: process rank 3 is bound to device 3                                             
[04/28/23 13:56:11] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                    
                    INFO     colossalai - colossalai - INFO: process rank 4 is bound to device 4                                             
[04/28/23 13:56:11] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                    
                    INFO     colossalai - colossalai - INFO: process rank 2 is bound to device 2                                             
[04/28/23 13:56:11] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device                    
                    INFO     colossalai - colossalai - INFO: process rank 6 is bound to device 6                                             
[04/28/23 13:56:13] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                      
[04/28/23 13:56:13] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                      
[04/28/23 13:56:13] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                      
[04/28/23 13:56:13] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                      
[04/28/23 13:56:13] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                      
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 42, python random: 42, ParallelMode.DATA: 42,
                             ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 6, numpy: 42, python random: 42, ParallelMode.DATA: 42,
                             ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 5, numpy: 42, python random: 42, ParallelMode.DATA: 42,
                             ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 42, python random: 42, ParallelMode.DATA: 42,
                             ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 7, numpy: 42, python random: 42, ParallelMode.DATA: 42,
                             ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                         
[04/28/23 13:56:13] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                      
[04/28/23 13:56:13] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                      
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 42, python random: 42, ParallelMode.DATA: 42,
                             ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                         
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 4, numpy: 42, python random: 42, ParallelMode.DATA: 42,
                             ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                         
[04/28/23 13:56:13] INFO     colossalai - colossalai - INFO:                                                                                 
                             /usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed                      
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA: 42,
                             ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.                                         
                    INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:115 launch      
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 8, pipeline parallel
                             size: 1, tensor parallel size: 1                                                                                
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.36it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:21<00:00,  1.54it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.37it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:21<00:00,  1.56it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:21<00:00,  1.56it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:21<00:00,  1.53it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:20<00:00,  1.57it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:23<00:00,  1.38it/s]

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
  warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.34it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:23<00:00,  1.40it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:22<00:00,  1.44it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:23<00:00,  1.40it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:26<00:00,  1.25it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.36it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.34it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:25<00:00,  1.31it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[04/28/23 13:59:16] INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [0]                                                                                              
[04/28/23 13:59:18] INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [1]                                                                                              
                    INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [2]                                                                                              
                    INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [3]                                                                                              
                    INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [4]                                                                                              
                    INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [5]                                                                                              
                    INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [6]                                                                                              
                    INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [7]                                                                                              
                    INFO     colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22  
                             log_pg_init                                                                                                     
                    INFO     colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:                                                    
                                     backend: nccl                                                                                           
                                     ranks: [0, 1, 2, 3, 4, 5, 6, 7]                                                                         
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:19<00:00,  1.71it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:19<00:00,  1.71it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:19<00:00,  1.71it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:19<00:00,  1.68it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.35it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:20<00:00,  1.65it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.33it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:21<00:00,  1.57it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.34it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:22<00:00,  1.45it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:23<00:00,  1.41it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:22<00:00,  1.45it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:22<00:00,  1.45it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.37it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.35it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████| 33/33 [00:24<00:00,  1.34it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:31] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__     
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__     
                    INFO     colossalai - colossalai - INFO: Loaded 429 examples.                                                            
[04/28/23 14:03:31] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__     
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__     
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__     
                    INFO     colossalai - colossalai - INFO: Loaded 429 examples.                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__     
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__       
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__       
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
[04/28/23 14:03:32] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__       
                    INFO     colossalai - colossalai - INFO: Loaded 52191 examples.                                                          
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__       
[04/28/23 14:03:32] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__       
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: Loaded 52191 examples.                                                          
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__       
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__       
                    INFO     colossalai - colossalai - INFO: Formatting inputs...                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__       
                    INFO     colossalai - colossalai - INFO: Formatting inputs...                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__       
                    INFO     colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__       
                    INFO     colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...                                 
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building.
[extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building.
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building.
[extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:32] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__     
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__     
                    INFO     colossalai - colossalai - INFO: Loaded 429 examples.                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__     
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:32] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__     
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__     
                    INFO     colossalai - colossalai - INFO: Loaded 429 examples.                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__     
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__       
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__       
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:33] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__     
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__     
                    INFO     colossalai - colossalai - INFO: Loaded 429 examples.                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__     
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__       
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
[04/28/23 14:03:33] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__       
                    INFO     colossalai - colossalai - INFO: Loaded 52191 examples.                                                          
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__       
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__       
                    INFO     colossalai - colossalai - INFO: Formatting inputs...                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__       
                    INFO     colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...                                 
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:33] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__     
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__     
                    INFO     colossalai - colossalai - INFO: Loaded 429 examples.                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__     
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
[04/28/23 14:03:33] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__       
                    INFO     colossalai - colossalai - INFO: Loaded 52191 examples.                                                          
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__       
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__       
                    INFO     colossalai - colossalai - INFO: Formatting inputs...                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__       
                    INFO     colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__       
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:33] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__     
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__     
                    INFO     colossalai - colossalai - INFO: Loaded 429 examples.                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__     
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__       
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__       
                    INFO     colossalai - colossalai - INFO: Loaded 52191 examples.                                                          
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__       
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__       
                    INFO     colossalai - colossalai - INFO: Formatting inputs...                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__       
                    INFO     colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...                                 
[04/28/23 14:03:34] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__       
                    INFO     colossalai - colossalai - INFO: Loaded 52191 examples.                                                          
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__       
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__       
                    INFO     colossalai - colossalai - INFO: Formatting inputs...                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__       
                    INFO     colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...                                 
[04/28/23 14:03:34] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__       
                    INFO     colossalai - colossalai - INFO: Loaded 52191 examples.                                                          
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__       
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__       
                    INFO     colossalai - colossalai - INFO: Formatting inputs...                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__       
                    INFO     colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...                                 
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:36] INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__     
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__     
                    INFO     colossalai - colossalai - INFO: Loaded 429 examples.                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__     
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__       
                    INFO     colossalai - colossalai - INFO: Loading data...                                                                 
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__       
                    INFO     colossalai - colossalai - INFO: Loaded 52191 examples.                                                          
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__       
                    INFO     colossalai - colossalai - INFO: Limiting dataset to 16384 examples.                                             
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__       
                    INFO     colossalai - colossalai - INFO: Formatting inputs...                                                            
                    INFO     colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__       
                    INFO     colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...                                 
searching chunk configuration is completed in 0.54 s.
used number: 6426.25 MB, wasted number: 705.62 MB
total wasted percentage is 9.89%
searching chunk configuration is completed in 0.54 s.
used number: 6301.26 MB, wasted number: 706.40 MB
total wasted percentage is 10.08%
[04/28/23 14:04:44] INFO     colossalai - coati.trainer.strategies.colossalai - INFO:                                                        
                             /root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model                        
                    INFO     colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class                                     
                             'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model                                          
[04/28/23 14:04:44] INFO     colossalai - coati.trainer.strategies.colossalai - INFO:                                                        
                             /root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model                        
                    INFO     colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class                                     
                             'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model                                          
[04/28/23 14:04:44] INFO     colossalai - coati.trainer.strategies.colossalai - INFO:                                                        
                             /root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model                        
                    INFO     colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class                                     
                             'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model                                          
[04/28/23 14:04:44] INFO     colossalai - coati.trainer.strategies.colossalai - INFO:                                                        
                             /root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model                        
                    INFO     colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class                                     
                             'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model                                          
[04/28/23 14:04:44] INFO     colossalai - coati.trainer.strategies.colossalai - INFO:                                                        
                             /root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model                        
[04/28/23 14:04:45] INFO     colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class                                     
                             'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model                                          
[04/28/23 14:04:45] INFO     colossalai - coati.trainer.strategies.colossalai - INFO:                                                        
                             /root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model                        
                    INFO     colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class                                     
                             'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model                                          
[04/28/23 14:04:45] INFO     colossalai - coati.trainer.strategies.colossalai - INFO:                                                        
                             /root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model                        
                    INFO     colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class                                     
                             'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model                                          
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194177 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194178 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194179 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194180 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194181 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194182 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194183 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 2194184) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
train_prompts.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-28_14:04:46
  host      : BJ-G104-79-70.local
  rank      : 7 (local_rank: 7)
  exitcode  : -9 (pid: 2194184)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 2194184
========================================================

ifromeast commented 1 year ago

@JThh I think it may because DP=8 cause OOM, how to use PP to reduce global_batch_size to 1 ?

JThh commented 1 year ago

PP is not quite applicable and testified yet in this scenario yet. Have you tested out ddp strategy, without colossalai?

JThh commented 1 year ago

Plus, can you use torch profiler to track your memory usage and then we can know which step caused oom?

cwszz commented 11 months ago

@JThh Hello, I got the same error when I try to train 34b model in 3 nodes (8 * 40G， 500G Main Memory). I have seen the cpu memory usage comes to 100%. Is there any way to solve this problem?

cwszz commented 11 months ago

Plus I set the tp = 8 in each node. So, I guess it maybe init the model 8 time.

HEIcby commented 4 weeks ago

https://github.com/vllm-project/vllm/issues/8998#issuecomment-2413388800

just delet the “assert”

hpcaitech / ColossalAI

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]: #3595

🐛 Describe the bug

Environment