Memory errors? - Githubissues

mattloose commented 2 months ago

Hi,

I'm trying to run Herro using an A100 card. Not clear why it is running out of memory. I was running with -b 128 so I've dropped that down.

I'm running with singularity. Any suggestions?

Thanks

[00:01:26] Processing 1/? batch _ [>---------------------------------------] 93/90774 [W manager.cpp:340] Warning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason. To debug try disable codegen fallback path via setting the env variable export PYTORCH_NVFUSER_DISABLE=fallback (function runCudaFusionGroup) thread '' panicked at src/inference.rs:172:64: called Result::unwrap() on an Err value: Torch("The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript (most recent call last):\nRuntimeError: The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript, serialized code (most recent call last):\n File \"code/torch/model.py\", line 36, in fallback_cuda_fuser\n x0 = torch.permute(x, [0, 3, 1, 2])\n qn = self.qn\n sliced_sequences_concatenated = (qn).forward(x0, target_positions, lengths, )\n ~~~ <--- HERE\n fc2 = self.fc2\n _1 = (fc2).forward(sliced_sequences_concatenated, )\n File \"code/torch/transformer.py\", line 16, in forward\n _0 = torch.torch.nn.utils.rnn.pad_sequence\n context_read = self.context_read\n x0 = (context_read).forward(x, )\n ~~~~~ <--- HERE\n context_pos = self.context_pos\n x1 = (context_pos).forward(x0, )\n File \"code/torch/torch/nn/modules/container.py\", line 15, in forward\n _2 = getattr(self, \"2\")\n input0 = (_0).forward(input, )\n input1 = (_1).forward(input0, )\n ~~~ <--- HERE\n return (_2).forward(input1, )\n def len(self: torch.torch.nn.modules.container.Sequential) -> int:\n File \"code/torch/torch/nn/modules/batchnorm.py\", line 35, in forward\n weight = self.weight\n bias = self.bias\n _3 = _0(input, running_mean, running_var, weight, bias, bn_training, 0.10000000000000001, 1.0000000000000001e-05, )\n ~~ <--- HERE\n return _3\n def _check_input_dim(self: torch.torch.nn.modules.batchnorm.BatchNorm2d,\n File \"code/torch/torch/nn/functional.py\", line 52, in batch_norm\n else:\n pass\n _6 = torch.batch_norm(input, weight, bias, running_mean, running_var, training, momentum, eps, True)\n ~~~~ <--- HERE\n return _6\ndef relu(input: Tensor,\n\nTraceback of TorchScript, original code (most recent call last):\n File \"/raid/scratch/stanojevicd/projects/haec-BigBird/model.py\", line 157, in fallback_cuda_fuser\n sliced_sequences_concatenated = torch.cat(encoded)'''\n x = x.permute((0, 3, 1, 2))\n sliced_sequences_concatenated = self.qn(x, target_positions, lengths)\n ~~~ <--- HERE\n \n # list of tensors of shape (selected_token_number, 1) -> (selected_token_number)\n File \"/raid/scratch/stanojevicd/projects/haec-BigBird/transformer.py\", line 36, in forward\n def forward(self, x: Tensor, target_positions: List[Tensor],\n lengths: Tensor) -> Tensor:\n x = self.context_read(x) # [B, I, L, R] -> [B, 128, L, R]\n ~~~~~ <--- HERE\n x = self.context_pos(x) # [B, 128, L, R] -> [B, 256, L, 1]\n x = x.squeeze(-1).transpose(1, 2) # [B, L, 256]\n File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/container.py\", line 215, in forward\n def forward(self, input):\n for module in self:\n input = module(input)\n ~~ <--- HERE\n return input\n File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py\", line 171, in forward\n used for normalization (i.e. in eval mode when buffers are not None).\n \"\"\"\n return F.batch_norm(\n ~~~~ <--- HERE\n input,\n # If buffers are not to be tracked, ensure that they won't be updated\n File \"/home/stanojevicd/miniforge3/envs/haec/lib/python3.11/site-packages/torch/nn/functional.py\", line 2478, in batch_norm\n _verify_batch_size(input.size())\n\n return torch.batch_norm(\n ~~~~ <--- HERE\n input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled\n )\nRuntimeError: CUDA out of memory. Tried to allocate 4.64 GiB (GPU 0; 9.50 GiB total capacity; 4.74 GiB already allocated; 1.93 GiB free; 7.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF\n\n") note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Aborted (core dumped)

mattloose commented 2 months ago

To follow up, lowering the batch size "fixes" this to some degree. But I've had to lower the batch size to 6! And I'm on an A100...

Recommended batch size is 64 for GPUs with 40 GB (possibly also for 32 GB) of VRAM and 128 for GPUs with 80 GB of VRAM.

dominikstanojevic commented 1 month ago

Hi,

what is the VRAM for your A100 card? If I'm not mistaken CUDA is reporting only 9.5 GiB for some reason.

Tried to allocate 4.64 GiB (GPU 0; 9.50 GiB total capacity; 4.74 GiB already allocated; 1.93 GiB free; 7.06 GiB reserved in total by PyTorch)

mattloose commented 1 month ago

Sorry - I missed this.

You are right - the card on that node was limited due to the marvellous way our HPC is configured...

lbcb-sci / herro

Memory errors? #25