colossalai run --nproc_per_node 2 train_with_cifar10.py --config config.py
I have 7 GPUs with 16G GPU memory for each one.
The error traceback is like:
...
RuntimeError: CUDA out of memory. Tried to allocate 296.00 MiB (GPU 1; 15.78 GiB total capacity; 13.75 GiB already
allocated; 232.19 MiB free; 13.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "train_with_cifar10.py", line 71, in <module>
main()
File "train_with_cifar10.py", line 62, in main
trainer.fit(train_dataloader=train_dataloader,
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 321, in fit
self._train_epoch(
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/trainer/_trainer.py", line 181, in _train_epoch
logits, label, loss = self.engine.execute_schedule(
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 201, in execute_schedule
output, label, loss = self._schedule.forward_backward_step(self, data_iter, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/schedule/_non_pipeline_schedule.py", line 78, in forward_backward_step
output = self._call_engine(engine, data)
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/schedule/_base_schedule.py", line 109, in _call_engine
return engine(inputs)
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/engine/_base_engine.py", line 186, in __call__
return self.model(*args, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/colossalai/amp/torch_amp/torch_amp.py", line 79, in forward
return self.model(*args, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 465, in forward
x = self.forward_features(x)
File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 454, in forward_features
x = self.blocks(x)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/vision_transformer.py", line 243, in forward
x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x))))
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/timm/models/layers/mlp.py", line 29, in forward
x = self.drop1(x)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/modules/dropout.py", line 58, in forward
return F.dropout(input, self.p, self.training, self.inplace)
File "/home/wfh/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 1252, in dropout
return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
...
🐛 Describe the bug
I am trying train_with_cifar10.py in https://github.com/hpcaitech/ColossalAI-Examples/tree/main/image/vision_transformer/data_parallel
My command:
I have 7 GPUs with 16G GPU memory for each one.
The error traceback is like:
Environment
GPU: