Executing with 1 GPU raises "OutOfMemory Exception", executing with 2 GPUs "RuntimeError: CUDA error: invalid device ordinal"

nmerkle commented 4 months ago

Hi,

I have tried to implement GPT2 from scratch according to the Video tutorial. However, if I try to execute the code on 2 GPUs with:

torchrun --standalone --nproc_per_node=2 GPT.py

My program fails with the following error message:

Device: cuda:1

Device Count: 1

[rank1]: Traceback (most recent call last):

[rank1]:   File "/my_transformer/GPT.py", line 238, in <module>

[rank1]:     torch.cuda.set_device(device)

[rank1]:   File "/.local/lib/python3.9/site-packages/torch/cuda/__init__.py", line 399, in set_device

[rank1]:     torch._C._cuda_setDevice(device)

[rank1]: RuntimeError: CUDA error: invalid device ordinal

[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

[rank1]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Device: cuda:0

Device Count: 1

Master-Process: True

Total desired batch size: 524288

Calculated gradient accumulation steps: 16.

loaded 338025 tokens.

W0626 10:12:11.821799 22703772874560 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 2031187 closing signal SIGTERM

E0626 10:12:11.853472 22703772874560 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 2031188) of binary: ~/my_transformer/.venv/bin/python3.9

Traceback (most recent call last):

  File "~/my_transformer/.venv/bin/torchrun", line 8, in <module>

    sys.exit(main())

  File "/.local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper

    return f(*args, **kwargs)

  File "~/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 879, in main

    run(args)

  File "/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run

    elastic_launch(

  File "~/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__

    return launch_agent(self._config, self._entrypoint, list(args))

  File "~/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent

    raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

GPT.py FAILED

------------------------------------------------------------

Failures:

  <NO_OTHER_FAILURES>

------------------------------------------------------------

Root Cause (first observed failure):

[0]:

  time      : 2024-06-26_10:12:11

  host      : haicn01.localdomain

  rank      : 1 (local_rank: 1)

  exitcode  : 1 (pid: 2031188)

  error_file: <N/A>

  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

If I execute with just 1 GPU, I get another error:

[rank0]: OutOfMemoryError: CUDA out of memory. Tried to allocate 786.00 MiB. GPU

Any ideas what could be the reason? I exactly followed the video tutorial and also checked the code in the repository. I should have enough memory. According to nvidia-smi I get the following output:

Wed Jun 26 10:51:56 2024      

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |

|-----------------------------------------+------------------------+----------------------+

| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |

| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |

|                                         |                        |               MIG M. |

|=========================================+========================+======================|

|   0  NVIDIA A100-SXM4-40GB          On  |   00000000:CA:00.0 Off |                   On |

| N/A   55C    P0            165W /  400W |     612MiB /  40960MiB |     N/A      Default |

|                                         |                        |              Enabled |

+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+

| MIG devices:                                                                            |

+------------------+----------------------------------+-----------+-----------------------+

| GPU  GI  CI  MIG |                     Memory-Usage |        Vol|      Shared           |

|      ID  ID  Dev |                       BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |

|                  |                                  |        ECC|                       |

|==================+==================================+===========+=======================|

|  0    8   0   0  |              12MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |

|                  |                 0MiB /  8191MiB  |           |                       |

+------------------+----------------------------------+-----------+-----------------------+

|  0    9   0   1  |              12MiB /  4864MiB    | 14      0 |  1   0    0    0    0 |

|                  |                 0MiB /  8191MiB  |           |                       |

+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+

| Processes:                                                                              |

|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |

|        ID   ID                                                               Usage      |

|=========================================================================================|

|  No running processes found                                                             |

+-----------------------------------------------------------------------------------------+

Thanks in advance.

andytwigg commented 4 months ago

IIUC, @karpathy used A100-80G but you seem to have 40G - have you tried reducing the batch size B to say 16 or 32? https://github.com/karpathy/build-nanogpt/blob/master/train_gpt2.py#L325

nmerkle commented 4 months ago

@andytwigg Thank you for your answer. I used again 1 GPU with 40 GB and decreased the batch size to 2 and then I got another error:

[rank0]: x, y = data_loader.next_batch() [rank0]: File "~/my_transformer/GPT.py", line 215, in next_batch [rank0]: x = (buf[:-1]).view(B,T) [rank0]: RuntimeError: shape '[2, 1024]' is invalid for input of size 104

I think the problem now is that in the "next_batch()" function (see line https://github.com/karpathy/build-nanogpt/blob/master/train_gpt2.py#L243), the reshaping fails because the token size does not match when the end of the buffer is reached. The code runs a while but then raises the error message mentioned above because just 104 tokens remain for processing:

buf = self.tokens[self.current_position : self.current_position+B*T+1]
x = (buf[:-1]).view(B, T)

Any idea how to address this? I was thinking to check with a modulo (%) operator whether the remaining tokens are divisble through (B*T+1). However, I think that would be a quick and dirty solution. Any other suggestions? I am wondering why it works in the tutorial. I guess I must have missed something.

karpathy / build-nanogpt

Executing with 1 GPU raises "OutOfMemory Exception", executing with 2 GPUs "RuntimeError: CUDA error: invalid device ordinal" #41