Open ifromeast opened 1 year ago
@JThh Could you help to have a look at this problem?
Looks like there are some racing processes. Can you check ps aux | grep python
and possibly kill other unused processes before running again?
@JThh > ps aux | grep python
I killed all the ps, but the ERROR insist
GPU Memory Usage:
0 0 MiB
1 0 MiB
2 0 MiB
3 0 MiB
4 0 MiB
5 0 MiB
6 0 MiB
7 0 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388065 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388067 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388068 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388069 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388070 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388071 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1388072 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 1388066) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
train_prompts.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-19_19:36:19
host : BJ-G104-79-70.local
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 1388066)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 1388066
========================================================
Alright, now I'd think it is due to your main memory OOM. Can you check if dmesg -T | egrep -i 'killed process'
returns any valid message?
dmesg -T | egrep -i 'killed process'
If message like below, it may come from memory OOM?
Killed process 57636 (python3.10) total-vm:102681144kB, anon-rss:62998680kB, file-rss:115876kB, shmem-rss:49160kB
Yes, likely.
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
Yes, likely.
Now you might want to try our other strategies to lower your main memory usage!
@JThh I see that
colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 8, pipeline parallel size: 1, tensor parallel size: 1
it means the global batch size is 8, so how can I set DP=1 and TP=8.
This example does not support TP yet. Have you tried colossalai_gemini
strategy and set placement to be cuda
?
This example does not support TP yet. Have you tried
colossalai_gemini
strategy and set placement to becuda
?
@JThh Yes, but OOM. I am so curious that why OOM occurs even on 8*A6000(40GB) & BS=1, can you give more advice?
May I know when OOM happened? Was it after the model init or the start of first epoch training?
With the same strategy, how about setting placement to be βcpuβ? Some user reported it worked.
With the same strategy, how about setting placement to be βcpuβ? Some user reported it worked.
@JThh it still OOM, the following is the log
GPU Memory Usage:
0 0 MiB
1 0 MiB
2 0 MiB
3 0 MiB
4 0 MiB
5 0 MiB
6 0 MiB
7 0 MiB
Now CUDA_VISIBLE_DEVICES is set to:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[04/28/23 13:56:11] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device
[04/28/23 13:56:11] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 5 is bound to device 5
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[04/28/23 13:56:11] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device
[04/28/23 13:56:11] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 7 is bound to device 7
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[04/28/23 13:56:11] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
[04/28/23 13:56:11] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 4 is bound to device 4
[04/28/23 13:56:11] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
[04/28/23 13:56:11] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 6 is bound to device 6
[04/28/23 13:56:13] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed
[04/28/23 13:56:13] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed
[04/28/23 13:56:13] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed
[04/28/23 13:56:13] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed
[04/28/23 13:56:13] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 42, python random: 42, ParallelMode.DATA: 42,
ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 6, numpy: 42, python random: 42, ParallelMode.DATA: 42,
ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 5, numpy: 42, python random: 42, ParallelMode.DATA: 42,
ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 42, python random: 42, ParallelMode.DATA: 42,
ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 7, numpy: 42, python random: 42, ParallelMode.DATA: 42,
ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
[04/28/23 13:56:13] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed
[04/28/23 13:56:13] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 42, python random: 42, ParallelMode.DATA: 42,
ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on rank 4, numpy: 42, python random: 42, ParallelMode.DATA: 42,
ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
[04/28/23 13:56:13] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.8/dist-packages/colossalai/context/parallel_context.py:558 set_seed
INFO colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 42, python random: 42, ParallelMode.DATA: 42,
ParallelMode.TENSOR: 42,the default parallel seed is ParallelMode.DATA.
INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:115 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 8, pipeline parallel
size: 1, tensor parallel size: 1
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.36it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:21<00:00, 1.54it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.37it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:21<00:00, 1.56it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:21<00:00, 1.56it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:21<00:00, 1.53it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:20<00:00, 1.57it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:23<00:00, 1.38it/s]
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/root/anaconda3/envs/rlhf/lib')}
warn(msg)
/usr/local/lib/python3.8/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /root/anaconda3/envs/rlhf did not contain libcudart.so as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.7/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /usr/local/lib/python3.8/dist-packages/bitsandbytes/libbitsandbytes_cuda117.so...
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.34it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:23<00:00, 1.40it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:22<00:00, 1.44it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:23<00:00, 1.40it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:26<00:00, 1.25it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.36it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.34it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:25<00:00, 1.31it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
[04/28/23 13:59:16] INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [0]
[04/28/23 13:59:18] INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [1]
INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [2]
INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [3]
INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [4]
INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [5]
INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [6]
INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [7]
INFO colossalai - ProcessGroup - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/tensor/process_group.py:22
log_pg_init
INFO colossalai - ProcessGroup - INFO: Pytorch ProcessGroup Init:
backend: nccl
ranks: [0, 1, 2, 3, 4, 5, 6, 7]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:19<00:00, 1.71it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:19<00:00, 1.71it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:19<00:00, 1.71it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:19<00:00, 1.68it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.35it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:20<00:00, 1.65it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.33it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:21<00:00, 1.57it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.34it/s]
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:22<00:00, 1.45it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:23<00:00, 1.41it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:22<00:00, 1.45it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:22<00:00, 1.45it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.37it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.35it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 33/33 [00:24<00:00, 1.34it/s]
Some weights of the model checkpoint at decapoda-research/llama-7b-hf were not used when initializing LlamaModel: ['lm_head.weight']
- This IS expected if you are initializing LlamaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:31] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__
INFO colossalai - colossalai - INFO: Loaded 429 examples.
[04/28/23 14:03:31] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__
INFO colossalai - colossalai - INFO: Loaded 429 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__
INFO colossalai - colossalai - INFO: Loading data...
[04/28/23 14:03:32] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__
INFO colossalai - colossalai - INFO: Loaded 52191 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__
[04/28/23 14:03:32] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: Loaded 52191 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building.
[extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building.
[extension] OP colossalai._C.cpu_adam has been compileed ahead of time, skip building.
[extension] OP colossalai._C.fused_optim has been compileed ahead of time, skip building.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:32] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__
INFO colossalai - colossalai - INFO: Loaded 429 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:32] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__
INFO colossalai - colossalai - INFO: Loaded 429 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__
INFO colossalai - colossalai - INFO: Loading data...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:33] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__
INFO colossalai - colossalai - INFO: Loaded 429 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__
INFO colossalai - colossalai - INFO: Loading data...
[04/28/23 14:03:33] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__
INFO colossalai - colossalai - INFO: Loaded 52191 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:33] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__
INFO colossalai - colossalai - INFO: Loaded 429 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
[04/28/23 14:03:33] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__
INFO colossalai - colossalai - INFO: Loaded 52191 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__
INFO colossalai - colossalai - INFO: Loading data...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:33] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__
INFO colossalai - colossalai - INFO: Loaded 429 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__
INFO colossalai - colossalai - INFO: Loaded 52191 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
[04/28/23 14:03:34] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__
INFO colossalai - colossalai - INFO: Loaded 52191 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
[04/28/23 14:03:34] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__
INFO colossalai - colossalai - INFO: Loaded 52191 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'.
The class this function is called from is 'LlamaTokenizer'.
[04/28/23 14:03:36] INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:25 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:27 __init__
INFO colossalai - colossalai - INFO: Loaded 429 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/prompt_dataset.py:30 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:125 __init__
INFO colossalai - colossalai - INFO: Loading data...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:127 __init__
INFO colossalai - colossalai - INFO: Loaded 52191 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:130 __init__
INFO colossalai - colossalai - INFO: Limiting dataset to 16384 examples.
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:133 __init__
INFO colossalai - colossalai - INFO: Formatting inputs...
INFO colossalai - colossalai - INFO: /root/alpaca_test/TeachBot/rlhf/coati/dataset/sft_dataset.py:141 __init__
INFO colossalai - colossalai - INFO: Tokenizing inputs... This may take some time...
searching chunk configuration is completed in 0.54 s.
used number: 6426.25 MB, wasted number: 705.62 MB
total wasted percentage is 9.89%
searching chunk configuration is completed in 0.54 s.
used number: 6301.26 MB, wasted number: 706.40 MB
total wasted percentage is 10.08%
[04/28/23 14:04:44] INFO colossalai - coati.trainer.strategies.colossalai - INFO:
/root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model
INFO colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class
'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model
[04/28/23 14:04:44] INFO colossalai - coati.trainer.strategies.colossalai - INFO:
/root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model
INFO colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class
'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model
[04/28/23 14:04:44] INFO colossalai - coati.trainer.strategies.colossalai - INFO:
/root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model
INFO colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class
'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model
[04/28/23 14:04:44] INFO colossalai - coati.trainer.strategies.colossalai - INFO:
/root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model
INFO colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class
'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model
[04/28/23 14:04:44] INFO colossalai - coati.trainer.strategies.colossalai - INFO:
/root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model
[04/28/23 14:04:45] INFO colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class
'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model
[04/28/23 14:04:45] INFO colossalai - coati.trainer.strategies.colossalai - INFO:
/root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model
INFO colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class
'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model
[04/28/23 14:04:45] INFO colossalai - coati.trainer.strategies.colossalai - INFO:
/root/alpaca_test/TeachBot/rlhf/coati/trainer/strategies/colossalai.py:164 _unwrap_model
INFO colossalai - coati.trainer.strategies.colossalai - INFO: model type: <class
'colossalai.zero.gemini.gemini_ddp.GeminiDDP'>, get static torch model
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194177 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194178 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194179 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194180 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194181 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194182 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2194183 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 7 (pid: 2194184) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
train_prompts.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-28_14:04:46
host : BJ-G104-79-70.local
rank : 7 (local_rank: 7)
exitcode : -9 (pid: 2194184)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 2194184
========================================================
@JThh I think it may because DP=8 cause OOM, how to use PP to reduce global_batch_size to 1 ?
PP is not quite applicable and testified yet in this scenario yet. Have you tested out ddp strategy, without colossalai?
Plus, can you use torch profiler to track your memory usage and then we can know which step caused oom?
@JThh Hello, I got the same error when I try to train 34b model in 3 nodes (8 * 40GοΌ 500G Main Memory). I have seen the cpu memory usage comes to 100%. Is there any way to solve this problem?
Plus I set the tp = 8 in each node. So, I guess it maybe init the model 8 time.
https://github.com/vllm-project/vllm/issues/8998#issuecomment-2413388800
just delet the βassertβ
π Describe the bug
GPU: 8*A6000 CUDA Version: 11.7 Python Version: 3.8.10 colossalai Version: 0.2.8
when I train PPO by
the ERROR occurs that
what should I do to overcome it?
Environment
GPU: 8*A6000 CUDA Version: 11.7 Python Version: 3.8.10 colossalai Version: 0.2.8