Open songyuc opened 2 years ago
Hi, could you provide your training code for us to reproduce this bug? Besides, could you double-check your dataset settings?
I have tried our code with a simple change of model from resnet to shufflenet. It takes about 32521MiB withBATCH_SIZE = 16384
, and no OOM occurred.
Hi @songyuc, you can uninstall your current colossalai
and install our latest version with
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI
# install dependency
pip install -r requirements/requirements.txt
# install colossalai
pip install .
There was a bug in previous release that takes up extra GPU memory. With our latest version, BATCH_SIZE=16384
only takes about 10605MiB. Hope this could solve your issue.
Hi @songyuc, you can uninstall your current
colossalai
and install our latest version withgit clone https://github.com/hpcaitech/ColossalAI.git cd ColossalAI # install dependency pip install -r requirements/requirements.txt # install colossalai pip install .
There was a bug in previous release that takes up extra GPU memory. With our latest version,
BATCH_SIZE=16384
only takes about 10605MiB. Hope this could solve your issue.
Thank you for the guide! I will try it later.
🐛 Describe the bug
models.shufflenet_v2_x1_0
can be trained withBATCH_SIZE = 16384
, which cannot be run successfully with ColossalAI. The information is below:Environment
CUDA: 11.4