hpcaitech / ColossalAI-Examples

Examples of training models with hybrid parallelism using ColossalAI
Apache License 2.0
334 stars 102 forks source link

Failed to run gpt2_3d example #99

Closed FJRFrancio closed 2 years ago

FJRFrancio commented 2 years ago

Dear developers,

I am trying to run the gpt2_3d example but failed. It looks like the model didn't load the correct batch size. Hope to get some advice.

Thanks.

Error

File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d

assert dim_size % world_size == 0, \

AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly.

Command

torchrun --standalone --nproc_per_node=8 train_gpt.py --config=gpt2_configs/gpt2_3d.py --from_torch

Environment

Error details

$ torchrun --standalone --nproc_per_node=8 ./train_gpt.py --config=./gpt2_configs/gpt2_3d.py  --from_torch
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/apex-0.1-py3.8-linux-x86_64.egg/apex/pyprof/__init__.py:5: FutureWarning: pyprof will be removed by the end of June, 2022
  warnings.warn("pyprof will be removed by the end of June, 2022", FutureWarning)
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                    INFO     colossalai - colossalai - INFO: process rank 2 is bound to device 2
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 2, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1026,the default parallel seed is
                             ParallelMode.DATA.
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                    INFO     colossalai - colossalai - INFO: process rank 3 is bound to device 3
                    INFO     colossalai - colossalai - INFO: process rank 7 is bound to device 7
                    INFO     colossalai - colossalai - INFO: process rank 1 is bound to device 1
[05/01/22 10:53:55] INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:509 set_device
                    INFO     colossalai - colossalai - INFO: process rank 4 is bound to device 4
                    INFO     colossalai - colossalai - INFO: process rank 5 is bound to device 5
                    INFO     colossalai - colossalai - INFO: process rank 0 is bound to device 0
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: process rank 6 is bound to device 6
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 3, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1027,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 7, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1031,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 1, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1025,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/context/parallel_context.py:545 set_seed
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 4, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1028,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 0, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1024,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 5, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1029,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO: initialized seed on rank 6, numpy: 1024, python random: 1024,
                             ParallelMode.DATA: 1024, ParallelMode.TENSOR: 1030,the default parallel seed is
                             ParallelMode.DATA.
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:109 launch
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, data parallel size: 1,                             pipeline parallel size: 1, tensor parallel size: 8
                    INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:45 main
                    INFO     colossalai - colossalai - INFO: Build data loader
                    INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:54 main
                    INFO     colossalai - colossalai - INFO: Build model
[05/01/22 10:54:01] INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:84 main
                    INFO     colossalai - colossalai - INFO: Build optimizer
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:240 initialize
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    INFO     colossalai - colossalai - INFO:
                             ========== Your Config ========
                             {'BATCH_SIZE': 4,
                              'NUM_EPOCHS': 60,
                              'SEQ_LEN': 1024,
                              'TENSOR_PARALLEL': 8,
                              'fp16': {'mode': <AMP_TYPE.NAIVE: 'naive'>},
                              'gpt2_small': <function gpt2_small at 0x7f32a53354c0>,
                              'loss': {'type': <class 'model_zoo.gpt.gpt.GPTLMLoss'>},
                              'model': {'checkpoint': True},
                              'optimizer': {'lr': 0.00015, 'weight_decay': 0.01},
                              'parallel': {'pipeline': 1, 'tensor': {'mode': '3d', 'size': 8}}}
                             ================================

                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:252 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:01] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    INFO     colossalai - colossalai - INFO: cuDNN benchmark = True, deterministic = False
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
                    WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:02] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:281 initialize
                    WARNING  colossalai - colossalai - WARNING: Initializing an non ZeRO model with optimizer class
[05/01/22 10:54:02] WARNING  colossalai - colossalai - WARNING:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/initialize.py:409 initialize
                    WARNING  colossalai - colossalai - WARNING: No PyTorch DDP or gradient handler is set up, please make
                             sure you do not need to all-reduce the gradients after a training step.
                    INFO     colossalai - colossalai - INFO: ./train_gpt_0.1.2.py:98 main
                    INFO     colossalai - colossalai - INFO: Init done, global batch size = 4
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LossHook for training, priority = 0
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LRSchedulerHook for training, priority = 1
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LogMetricByEpochHook for training, priority = 10
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using ThroughputHook for training, priority = 10
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LogMetricByStepHook for training, priority = 10
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:315 fit
                    INFO     colossalai - colossalai - INFO: Using LogMemoryByEpochHook for training, priority = 10
                    INFO     colossalai - colossalai - INFO:
                             /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py:319 fit
                    INFO     colossalai - colossalai - INFO: Lower value means higher priority for calling hook function
                    INFO     colossalai - colossalai - INFO: /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossal                             ai/utils/memory_utils/memory_monitor.py:63 report_memory_usage
                    INFO     colossalai - colossalai - INFO: Before-train: GPU: allocated 91.75 MB, max allocated 92.3 MB,
                             cached: 96.0 MB, max cached: 96.0 MB
[Epoch 0 / Train]:   0%|                                                                             | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
Traceback (most recent call last):
  File "./train_gpt_0.1.2.py", line 132, in <module>
Traceback (most recent call last):
  File "./train_gpt_0.1.2.py", line 132, in <module>
  File "./train_gpt_0.1.2.py", line 132, in <module>
    main()Traceback (most recent call last):

  File "./train_gpt_0.1.2.py", line 132, in <module>
  File "./train_gpt_0.1.2.py", line 120, in main
Traceback (most recent call last):
      File "./train_gpt_0.1.2.py", line 132, in <module>
main()
  File "./train_gpt_0.1.2.py", line 120, in main
    main()
    main()
  File "./train_gpt_0.1.2.py", line 120, in main
  File "./train_gpt_0.1.2.py", line 120, in main
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
trainer.fit(
self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    self._train_epoch(
    logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
main()  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit

    logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
  File "./train_gpt_0.1.2.py", line 120, in main
    self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
Traceback (most recent call last):
self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
  File "./train_gpt_0.1.2.py", line 132, in <module>
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
        logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    self._train_epoch(
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
        output = self._call_engine(engine, data)output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    main()
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "./train_gpt_0.1.2.py", line 120, in main
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
        return engine(**inputs)
return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    output = self._call_engine(engine, data)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
trainer.fit(
output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    return self.model(*args, **kwargs)
        return engine(**inputs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
    self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
        return self.model(*args, **kwargs)logits, label, loss = self.engine.execute_schedule(

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    return forward_call(*input, **kwargs)return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
        return forward_call(*input, **kwargs)out = self.model(*args, **kwargs)out = self.model(*args, **kwargs)

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    x = self.embed(input_ids)
        return forward_call(*input, **kwargs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
[Epoch 0 / Train]:   0%|                                                                             | 0/5 [00:00<?, ?it/s]Traceback (most recent call last):
  File "./train_gpt_0.1.2.py", line 132, in <module>
        result = forward_call(*input, **kwargs)result = forward_call(*input, **kwargs)

      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
result = forward_call(*input, **kwargs)
main()  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward

  File "./train_gpt_0.1.2.py", line 120, in main
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
            return self._forward_func(*args)return self._forward_func(*args)    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)

x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward

return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
        output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)

  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    return self._forward_func(*args)
result = forward_call(*input, **kwargs)  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward

    return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
        assert dim_size % world_size == 0, \assert dim_size % world_size == 0, \

        result = forward_call(*input, **kwargs)AssertionErrorout = self.model(*args, **kwargs)
:
The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
AssertionError  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl

: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    result = forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
      File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    result = forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
Traceback (most recent call last):
  File "./train_gpt_0.1.2.py", line 132, in <module>
    main()
  File "./train_gpt_0.1.2.py", line 120, in main
    trainer.fit(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 334, in fit
    self._train_epoch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 185, in _train_epoch
    logits, label, loss = self.engine.execute_schedule(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 198, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 49, in forward_backward_step
    output = self._call_engine(engine, data)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 105, in _call_engine
    return engine(**inputs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 183, in __call__
    return self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/amp/naive_amp/naive_amp.py", line 145, in forward
    out = self.model(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 291, in forward
    x = self.embed(input_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/model_zoo/gpt/gpt.py", line 50, in forward
    x = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/colossalai_layer/_utils.py", line 38, in forward
    return self._forward_func(*args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/layers.py", line 976, in forward
    input_ = split_tensor_3d(input_, 0, self.weight_parallel_mode)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/colossalai/nn/layer/parallel_3d/_operation.py", line 281, in split_tensor_3d
    assert dim_size % world_size == 0, \
AssertionError: The dimension 0 to split, size (1) is not a multiple of world size (2), cannot split tensor evenly
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f0bb282b1bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f0bf06ba6ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f0bf06bccd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f0bf06bdf65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7f0c48562039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f0c6ecd8ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f0c6ea019fd in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fe8efe431bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fe92dcd26ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fe92dcd4cd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fe92dcd5f65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7fe985b3a039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7fe9ac2f0ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fe9ac0199fd in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7fdfff31b1bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7fe03d1aa6ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7fe03d1accd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7fe03d1adf65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7fe095012039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7fe0bb7c8ea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7fe0bb4f19fd in /lib64/libc.so.6)

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: driver shutting down
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from query at /opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/cuda/CUDAEvent.h:95 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f835f9611bd in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x11a (0x7f839d7f06ea in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7f839d7f2cd0 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #3: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x145 (0x7f839d7f3f65 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #4: <unknown function> + 0xc9039 (0x7f83f5658039 in /home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #5: <unknown function> + 0x7ea5 (0x7f841be0eea5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f841bb379fd in /lib64/libc.so.6)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 184844) of binary: /home/asc/.conda/envs/nlp/bin/python
Traceback (most recent call last):
  File "/home/asc/.conda/envs/nlp/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 724, in main
    run(args)
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/run.py", line 715, in run
    elastic_launch(
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/asc/.conda/envs/nlp/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./train_gpt_0.1.2.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 184845)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 184845
[2]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 184846)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 184846
[3]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 184847)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 184847
[4]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 4 (local_rank: 4)
  exitcode  : -6 (pid: 184848)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 184848
[5]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 184849)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 184850)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 184851)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-05-01_10:54:05
  host      : localhost.localdomain
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 184844)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
YuliangLiu0306 commented 2 years ago

Hi, I just run the same test with lastest main branch code, everything looks fine. I think the problem could be simply resolved by pulling the lastest code.

gogogwwb commented 2 years ago

Hi, I just run the same test with lastest main branch code, everything looks fine. I think the problem could be simply resolved by pulling the lastest code.

When I execute python setup.py install, this error occurs. gpt image

kurisusnowdeng commented 2 years ago

Hi, I just run the same test with lastest main branch code, everything looks fine. I think the problem could be simply resolved by pulling the lastest code.

When I execute python setup.py install, this error occurs. gpt image

Hi, could you please share more installation message, so that we could locate the problem?

FJRFrancio commented 2 years ago

Hi, I just run the same test with lastest main branch code, everything looks fine. I think the problem could be simply resolved by pulling the lastest code.

I tried the newest code, it works well when using default config files. But something goes wrong when I change the SEQ_LEN (in gpt2_2d.py) from 1024 to 2048. The error seems to happen at the all-reduce step.

image

gogogwwb commented 2 years ago

I am running with Docker,the Dockerfile is the same as this.Dockerfile

FJRFrancio commented 2 years ago

Hi, I just run the same test with lastest main branch code, everything looks fine. I think the problem could be simply resolved by pulling the lastest code.

I tried the newest code, it works well when using default config files. But something goes wrong when I change the SEQ_LEN (in gpt2_2d.py) from 1024 to 2048. The error seems to happen at the all-reduce step.

image

The code is not working properly when SEQ_LEN>1024.

FJRFrancio commented 2 years ago

Hi, I just run the same test with lastest main branch code, everything looks fine. I think the problem could be simply resolved by pulling the lastest code.

I tried the newest code, it works well when using default config files. But something goes wrong when I change the SEQ_LEN (in gpt2_2d.py) from 1024 to 2048. The error seems to happen at the all-reduce step. image

The code is not working properly when SEQ_LEN>1024.

The model has a default parameter max_position_embeddings=1024 and I forgot to change this. Now everything works fine.