Closed Sakura-gh closed 1 year ago
@Sakura-gh hello, I reproduce your run script, it is OK. I use my own dataset.json. Can you check if your dataset is set correctly? A simple way is to test the code with anothor small dataset.
@Sakura-gh hello, I reproduce your run script, it is OK. I use my own dataset.json. Can you check if your dataset is set correctly? A simple way is to test the code with anothor small dataset.
hi~ I've try the small dataset (just copy 10 json items from my origin hg_openwebtext.json), and the script runs successfully, but when I use the full hg_openwebtext.json dataset (about 39G), the same error occurs again!
so it seems the script runs error when use big dataset? Should some configs need to be adjust?
@Sakura-gh hello, I reproduce your run script, it is OK. I use my own dataset.json. Can you check if your dataset is set correctly? A simple way is to test the code with anothor small dataset.
update:重复了30多次实验,发现该现象并不是稳定可复现的,概率大概40%左右,不过都是训练完才报的错,所以对结果影响似乎没有那么大。关于最开始的问题,我发现把原json文件利用cp或cat >等手段copy为名称是train_data.json的文件后就不怎么出现该问题了,感觉是玄学...
old:你好,我试了一下,当json文件的items数达到1000000条时,就会出现上述问题,log如下(这里我把epoch设置为10):
(mlsys) root@28c67ac89ed8:/home/gehao/ColossalAI/ColossalAI-Examples/language/gpt# bash train.sh
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[08/03/22 05:51:43] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:52
1 set_device
[08/03/22 05:51:43] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:52
1 set_device
INFO colossalai - colossalai - INFO: process rank 1 is
bound to device 1
INFO colossalai - colossalai - INFO: process rank 3 is
bound to device 3
[08/03/22 05:51:43] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:52
1 set_device
[08/03/22 05:51:43] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:52
1 set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
INFO colossalai - colossalai - INFO: process rank 2 is
bound to device 2
[08/03/22 05:51:46] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:55
7 set_seed
[08/03/22 05:51:46] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:55
7 set_seed
[08/03/22 05:51:46] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:55
7 set_seed
INFO colossalai - colossalai - INFO: initialized seed on
rank 2, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1026,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: initialized seed on
rank 0, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1024,the default parallel seed is
ParallelMode.DATA.
[08/03/22 05:51:46] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/context/parallel_context.py:55
7 set_seed
INFO colossalai - colossalai - INFO: initialized seed on
rank 1, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1025,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/initialize.py:117 launch
INFO colossalai - colossalai - INFO: initialized seed on
rank 3, numpy: 1024, python random: 1024,
ParallelMode.DATA: 1024, ParallelMode.TENSOR:
1027,the default parallel seed is
ParallelMode.DATA.
INFO colossalai - colossalai - INFO: Distributed
environment is initialized, data parallel size: 1,
pipeline parallel size: 1, tensor parallel size: 4
INFO colossalai - colossalai - INFO:
/home/gehao/ColossalAI/ColossalAI-Examples/language
/gpt/train_gpt.py:43 main
INFO colossalai - colossalai - INFO: Build data loader
INFO colossalai - colossalai - INFO:
/home/gehao/ColossalAI/ColossalAI-Examples/language
/gpt/train_gpt.py:52 main
INFO colossalai - colossalai - INFO: Build model
WARNING colossalai - colossalai - WARNING:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/initialize.py:304 initialize
WARNING colossalai - colossalai - WARNING:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/initialize.py:304 initialize
WARNING colossalai - colossalai - WARNING: Initializing an
non ZeRO model with optimizer class
WARNING colossalai - colossalai - WARNING: Initializing an
non ZeRO model with optimizer class
WARNING colossalai - colossalai - WARNING:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/initialize.py:304 initialize
WARNING colossalai - colossalai - WARNING: Initializing an
non ZeRO model with optimizer class
INFO colossalai - colossalai - INFO:
/home/gehao/ColossalAI/ColossalAI-Examples/language
/gpt/train_gpt.py:109 main
INFO colossalai - colossalai - INFO: Build optimizer
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/initialize.py:266 initialize
INFO colossalai - colossalai - INFO:
========== Your Config ========
{'BATCH_SIZE': 4,
'NUM_EPOCHS': 10,
'SEQ_LEN': 1024,
'TENSOR_PARALLEL': 4,
'fp16': {'mode': <AMP_TYPE.NAIVE: 'naive'>},
'gpt2_small': <function gpt2_small at
0x7fbdf6294af0>,
'loss': {'type': <class
'titans.loss.lm_loss.gpt_lmloss.GPTLMLoss'>},
'model': {'checkpoint': True},
'optimizer': {'lr': 0.00015, 'weight_decay':
0.01},
'parallel': {'pipeline': 1, 'tensor': {'mode':
'2d', 'size': 4}}}
================================
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/initialize.py:278 initialize
INFO colossalai - colossalai - INFO: cuDNN benchmark =
False, deterministic = False
WARNING colossalai - colossalai - WARNING:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/initialize.py:304 initialize
WARNING colossalai - colossalai - WARNING: Initializing an
non ZeRO model with optimizer class
WARNING colossalai - colossalai - WARNING:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/initialize.py:443 initialize
WARNING colossalai - colossalai - WARNING: No PyTorch DDP
or gradient handler is set up, please make sure you
do not need to all-reduce the gradients after a
training step.
INFO colossalai - colossalai - INFO:
/home/gehao/ColossalAI/ColossalAI-Examples/language
/gpt/train_gpt.py:121 main
INFO colossalai - colossalai - INFO: Init done, global
batch size = 4
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/_trainer.py:304 fit
INFO colossalai - colossalai - INFO: Using LossHook for
training, priority = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/_trainer.py:304 fit
INFO colossalai - colossalai - INFO: Using
LRSchedulerHook for training, priority = 1
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/_trainer.py:304 fit
INFO colossalai - colossalai - INFO: Using
LogMetricByEpochHook for training, priority = 10
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/_trainer.py:304 fit
INFO colossalai - colossalai - INFO: Using
ThroughputHook for training, priority = 10
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/_trainer.py:304 fit
INFO colossalai - colossalai - INFO: Using
LogMetricByStepHook for training, priority = 10
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/_trainer.py:304 fit
INFO colossalai - colossalai - INFO: Using
LogMemoryByEpochHook for training, priority = 10
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/_trainer.py:308 fit
INFO colossalai - colossalai - INFO: Lower value means
higher priority for calling hook function
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: Before-train: GPU:
allocated 182.06 MB, max allocated 183.18 MB,
cached: 200.0 MB, max cached: 200.0 MB
[Epoch 0 / Train]: 100%|██████████| 1/1 [00:00<00:00, 1.07it/s, loss=343, lr=2.5e-5, throughput=4.3052 sample_per_sec, 3.9928 Tflops]
[08/03/22 05:51:47] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 0 / Train]:
Loss = 343.08 | LR = 5e-05 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 0 / Train]:
GPU: allocated 542.69 MB, max allocated 990.49 MB,
cached: 1412.0 MB, max cached: 1412.0 MB
[Epoch 1 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.05it/s, loss=332, lr=5e-5, throughput=8.1947 sample_per_sec, 7.5999 Tflops]
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 1 / Train]:
Loss = 331.72 | LR = 7.5e-05 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 1 / Train]:
GPU: allocated 548.14 MB, max allocated 1228.39 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
[Epoch 2 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.04it/s, loss=301, lr=7.5e-5, throughput=8.1812 sample_per_sec, 7.5874 Tflops]
[08/03/22 05:51:48] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 2 / Train]:
Loss = 300.66 | LR = 0.0001 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 2 / Train]:
GPU: allocated 545.19 MB, max allocated 1228.39 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
[Epoch 3 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.03it/s, loss=238, lr=0.0001, throughput=8.1482 sample_per_sec, 7.5568 Tflops]
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 3 / Train]:
Loss = 238.45 | LR = 0.000125 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 3 / Train]:
GPU: allocated 547.97 MB, max allocated 1228.39 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
[Epoch 4 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.02it/s, loss=164, lr=0.000125, throughput=8.0992 sample_per_sec, 7.5114 Tflops]
[08/03/22 05:51:49] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 4 / Train]:
Loss = 163.55 | LR = 0.00015 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 4 / Train]:
GPU: allocated 544.23 MB, max allocated 1228.89 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
[Epoch 5 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.05it/s, loss=107, lr=0.00015, throughput=8.2047 sample_per_sec, 7.6092 Tflops]
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 5 / Train]:
Loss = 107.16 | LR = 0.00012 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 5 / Train]:
GPU: allocated 544.3 MB, max allocated 1228.89 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
[Epoch 6 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.10it/s, loss=84.7, lr=0.00012, throughput=8.4114 sample_per_sec, 7.8009 Tflops]
[08/03/22 05:51:50] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 6 / Train]:
Loss = 84.669 | LR = 9e-05 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 6 / Train]:
GPU: allocated 420.46 MB, max allocated 1228.39 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
[Epoch 7 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.09it/s, loss=84.8, lr=9e-5, throughput=8.3854 sample_per_sec, 7.7768 Tflops]
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 7 / Train]:
Loss = 84.75 | LR = 6e-05 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 7 / Train]:
GPU: allocated 420.46 MB, max allocated 1228.89 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
[Epoch 8 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.03it/s, loss=84.7, lr=6e-5, throughput=8.117 sample_per_sec, 7.5279 Tflops]
[08/03/22 05:51:51] INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 8 / Train]:
Loss = 84.708 | LR = 3e-05 | Throughput = 0
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 8 / Train]:
GPU: allocated 543.87 MB, max allocated 1228.89 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
[Epoch 9 / Train]: 100%|██████████| 1/1 [00:00<00:00, 2.02it/s, loss=75.6, lr=3e-5, throughput=8.0952 sample_per_sec, 7.5076 Tflops]
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/trainer/hooks/_log_hook.py:97
after_train_epoch
INFO colossalai - colossalai - INFO: [Epoch 9 / Train]:
Loss = 75.587 | LR = 0 | Throughput = 8.0952
INFO colossalai - colossalai - INFO:
/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site
-packages/colossalai/utils/memory.py:91
report_memory_usage
INFO colossalai - colossalai - INFO: [Epoch 9 / Train]:
GPU: allocated 544.37 MB, max allocated 1228.39 MB,
cached: 1620.0 MB, max cached: 1620.0 MB
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 218624) of binary: /home/gehao/anaconda3/envs/mlsys/bin/python
Traceback (most recent call last):
File "/home/gehao/anaconda3/envs/mlsys/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.10.0', 'console_scripts', 'torchrun')())
File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/gehao/anaconda3/envs/mlsys/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train_gpt.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-08-03_05:51:54
host : 28c67ac89ed8
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 218624)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 218624
=======================================================
Error: failed to run torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=localhost:60075 --rdzv_id=colossalai-default-job train_gpt.py --world_size 4 --config=gpt2_configs/gpt2_2d.py --from_torch on 127.0.0.1
Hi @Sakura-gh,
I believe the problem is caused by an early exit of some process.
Maybe you can add a synchronization like torch.distributed.barrier()
in the end of the code to make all processes exit in the same time.
Hi @Sakura-gh,
I believe the problem is caused by an early exit of some process. Maybe you can add a synchronization like
torch.distributed.barrier()
in the end of the code to make all processes exit in the same time.
It doesn't work
We have updated a lot. This issue was closed due to inactivity. Thanks. https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt
🐛 Describe the bug
I use https://github.com/hpcaitech/ColossalAI-Examples to start GPT2 training example. But it seems run train_gpt.py FAILED. Can anyone give some helps? the code root:
ColossalAI-Examples/language/gpt
my script:the script log:
Environment
Python: 3.9.12
Pytorch: 1.10.0
CUDA: 10.2 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Wed_Oct_23_19:24:38_PDT_2019 Cuda compilation tools, release 10.2, V10.2.89