I'm really impressed by the work in Vision Graph Neural Networks. However, when trying to reproduce your work using the same code and setup (8 V100 with 32GB each), I get an out-of-memory error when I set the batch size to 1024.
and here's the error I run into:
File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 176, in forward x = self.graph_conv(x, relative_pos) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 129, in forward x = super(DyGraphConv2d, self).forward(x, edge_index, y) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 106, in forward return self.gconv(x, edge_index, y) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 22, in forward x_i = batched_index_select(x, edge_index[1]) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_nn.py", line 101, in batched_index_select feature = feature.view(batch_size, num_vertices, k, num_dims).permute(0, 3, 1, 2).contiguous() RuntimeError: CUDA out of memory. Tried to allocate 1.08 GiB (GPU 6; 31.75 GiB total capacity; 27.88 GiB already allocated; 748.94 MiB free; 29.47 GiB reserved in total by PyTorch)
It works just fine with batch size 516, but not with 1024.
Can you please help with that, I would really appreciate it?
Hi @iamhankai,
I'm really impressed by the work in Vision Graph Neural Networks. However, when trying to reproduce your work using the same code and setup (8 V100 with 32GB each), I get an out-of-memory error when I set the batch size to 1024.
Here's the command I use:
python -m torch.distributed.launch --nproc_per_node=8 train.py /ImageNet/ --model vig_ti_224_gelu --sched cosine --epochs 300 --opt adamw --warmup-lr 1e-6 --mixup .8 --cutmix 1.0 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --color-jitter 0.4 --warmup-epochs 20 --opt-eps 1e-8 --repeated-aug --remode pixel --reprob 0.25 --amp --lr 2e-3 --weight-decay .05 --drop 0 --drop-path .1 -b 1024 --output /outputDirectory/ --num-classes 1000
and here's the error I run into:
File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 176, in forward x = self.graph_conv(x, relative_pos) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 129, in forward x = super(DyGraphConv2d, self).forward(x, edge_index, y) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 106, in forward return self.gconv(x, edge_index, y) File "/raid/ismail2/miniconda3/envs/ViGEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_vertex.py", line 22, in forward x_i = batched_index_select(x, edge_index[1]) File "/raid/ismail2/vig_pytorch/gcn_lib/torch_nn.py", line 101, in batched_index_select feature = feature.view(batch_size, num_vertices, k, num_dims).permute(0, 3, 1, 2).contiguous() RuntimeError: CUDA out of memory. Tried to allocate 1.08 GiB (GPU 6; 31.75 GiB total capacity; 27.88 GiB already allocated; 748.94 MiB free; 29.47 GiB reserved in total by PyTorch)
It works just fine with batch size 516, but not with 1024.
Can you please help with that, I would really appreciate it?