RuntimeError: CUDA out of memory with cifar10 in data_parallel example

🐛 Describe the bug

I am trying train_with_cifar10.py in https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel

My command:

colossalai run --nproc_per_node 2 train_with_cifar10.py --config config.py

I have 7 GPUs with 16G GPU memory for each one.

The error traceback is like:

...
RuntimeError: CUDA out of memory. Tried to allocate 296.00 MiB (GPU 1; 15.78 GiB total capacity; 13.75 GiB already
 allocated; 232.19 MiB free; 13.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try 
setting max_split_size_mb to avoid fragmentation. 
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "train_with_cifar10.py", line 71, in <module>
    main()
  File "train_with_cifar10.py", line 62, in main
    trainer.fit(train_dataloader=train_dataloader,
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 321, in fit
    self._train_epoch(
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 181, in _train_epoch
    logits, label, loss = self.engine.execute_schedule(
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 201, in execute_schedule
    output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 78, in forward_backward_step
    output = self._call_engine(engine, data)
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 109, in _call_engine
    return engine(inputs)
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 186, in __call__
    return self.model(*args, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
    return func(*args, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/amp/torch_amp/torch_amp.py", line 79, in forward
    return self.model(*args, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 465, in forward
    x = self.forward_features(x)
  File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 454, in forward_features
    x = self.blocks(x)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 243, in forward
    x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x))))
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/layers/mlp.py", line 29, in forward
    x = self.drop1(x)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
...

Environment

>>> import colossalai
>>> colossalai.__version__
'0.1.9'
>>> import torch
>>> torch.__version__
'1.12.1+cu113'

GPU:

$nvidia-smi
Wed Sep 28 16:54:08 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   31C    P0    41W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:67:00.0 Off |                    0 |
| N/A   33C    P0    41W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:69:00.0 Off |                    0 |
| N/A   33C    P0    42W / 300W |      3MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   34C    P0    53W / 300W |   2360MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   34C    P0    56W / 300W |   4172MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:8F:00.0 Off |                    0 |
| N/A   32C    P0    54W / 300W |   7307MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

hpcaitech / ColossalAI-Examples

RuntimeError: CUDA out of memory with cifar10 in data_parallel example #176

🐛 Describe the bug

Environment