huawei-noah / Efficient-AI-Backbones

Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.
4.07k stars 708 forks source link

Batch size in ViG-Ti #262

Closed IsmaelElsharkawi closed 6 months ago

IsmaelElsharkawi commented 6 months ago

Hi @iamhankai,

I'm really impressed by the work in Vision Graph Neural Networks. However, when trying to reproduce your work using the same code and setup (8 V100 with 32GB each), I get an out-of-memory error when I set the batch size to 1024.

Here's the command I use: python -m torch.distributed.launch --nproc_per_node=8 train.py /ImageNet/ --model vig_ti_224_gelu --sched cosine --epochs 300 --opt adamw --warmup-lr 1e-6 --mixup .8 --cutmix 1.0 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --color-jitter 0.4 --warmup-epochs 20 --opt-eps 1e-8 --repeated-aug --remode pixel --reprob 0.25 --amp --lr 2e-3 --weight-decay .05 --drop 0 --drop-path .1 -b 1024 --output /outputDirectory/ --num-classes 1000

and here's the error I run into: File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 176, in forward x = self.graph_conv(x, relative_pos) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 129, in forward x = super(DyGraphConv2d, self).forward(x, edge_index, y) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 106, in forward return self.gconv(x, edge_index, y) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 22, in forward x_i = batched_index_select(x, edge_index[1]) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_nn.py", line 101, in batched_index_select feature = feature.view(batch_size, num_vertices, k, num_dims).permute(0, 3, 1, 2).contiguous() RuntimeError: CUDA out of memory. Tried to allocate 1.08 GiB (GPU 6; 31.75 GiB total capacity; 27.88 GiB already allocated; 748.94 MiB free; 29.47 GiB reserved in total by PyTorch)

It works just fine with batch size 516, but not with 1024.

Can you please help with that, I would really appreciate it?

IsmaelElsharkawi commented 6 months ago

I found my mistake, it's that the effective batch size is 128*8 (8 GPUs) = 1024. Thanks anyways :)