karpathy / llm.c

LLM training in simple, raw C/CUDA
MIT License
23.24k stars 2.58k forks source link

MPI run with 8 GPU fails #727

Open msharmavikram opened 1 month ago

msharmavikram commented 1 month ago
mpirun -np 8 ./train_gpt2cu
+-----------------------+----------------------------------------------------+
| Parameter             | Value                                              |
+-----------------------+----------------------------------------------------+
| train data pattern    | dev/data/tinyshakespeare/tiny_shakespeare_train.bin |
| val data pattern      | dev/data/tinyshakespeare/tiny_shakespeare_val.bin  |
| output log dir        | NULL                                               |
| checkpoint_every      | 0                                                  |
| resume                | 0                                                  |
| micro batch size B    | 4                                                  |
| sequence length T     | 1024                                               |
| total batch size      | 32768                                              |
| LR scheduler          | cosine                                             |
| learning rate (LR)    | 3.000000e-04                                       |
| warmup iterations     | 0                                                  |
| final LR fraction     | 1.000000e+00                                       |
| weight decay          | 0.000000e+00                                       |
| skip update lossz     | 0.000000                                           |
| skip update gradz     | 0.000000                                           |
| max_steps             | -1                                                 |
| val_loss_every        | 20                                                 |
| val_max_steps         | 20                                                 |
| sample_every          | 20                                                 |
| genT                  | 64                                                 |
| overfit_single_batch  | 0                                                  |
| use_master_weights    | enabled                                            |
| gelu_fusion           | 0                                                  |
| recompute             | 1                                                  |
+-----------------------+----------------------------------------------------+
| device                | NVIDIA A100-SXM4-80GB                              |
| peak TFlops           | 312.0                                              |
| precision             | BF16                                               |
+-----------------------+----------------------------------------------------+
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10951] *** Process received signal ***
[149-130-218-240:10951] Signal: Aborted (6)
[149-130-218-240:10951] Signal code:  (-6)
[149-130-218-240:10951] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe612442520]
[149-130-218-240:10951] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe6124969fc]
[149-130-218-240:10951] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe612442476]
[149-130-218-240:10951] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe6124287f3]
[149-130-218-240:10951] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe61242871b]
[149-130-218-240:10951] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe612439e96]
[149-130-218-240:10951] [ 6] ./train_gpt2cu(+0x17762)[0x55f5ea98f762]
[149-130-218-240:10951] [ 7] ./train_gpt2cu(+0xf120)[0x55f5ea987120]
[149-130-218-240:10951] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe612429d90]
[149-130-218-240:10951] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe612429e40]
[149-130-218-240:10951] [10] ./train_gpt2cu(+0x13275)[0x55f5ea98b275]
[149-130-218-240:10951] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10949] *** Process received signal ***
[149-130-218-240:10949] Signal: Aborted (6)
[149-130-218-240:10949] Signal code:  (-6)
[149-130-218-240:10949] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4969642520]
[149-130-218-240:10949] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f49696969fc]
[149-130-218-240:10949] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4969642476]
[149-130-218-240:10949] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f49696287f3]
[149-130-218-240:10949] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f496962871b]
[149-130-218-240:10949] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4969639e96]
[149-130-218-240:10949] [ 6] ./train_gpt2cu(+0x17762)[0x55756a4e6762]
[149-130-218-240:10949] [ 7] ./train_gpt2cu(+0xf120)[0x55756a4de120]
[149-130-218-240:10949] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4969629d90]
[149-130-218-240:10949] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4969629e40]
[149-130-218-240:10949] [10] ./train_gpt2cu(+0x13275)[0x55756a4e2275]
[149-130-218-240:10949] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10947] *** Process received signal ***
[149-130-218-240:10947] Signal: Aborted (6)
[149-130-218-240:10947] Signal code:  (-6)
[149-130-218-240:10947] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fd0d6042520]
[149-130-218-240:10947] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fd0d60969fc]
[149-130-218-240:10947] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fd0d6042476]
[149-130-218-240:10947] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fd0d60287f3]
[149-130-218-240:10947] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fd0d602871b]
[149-130-218-240:10947] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fd0d6039e96]
[149-130-218-240:10947] [ 6] ./train_gpt2cu(+0x17762)[0x55b68d44b762]
[149-130-218-240:10947] [ 7] ./train_gpt2cu(+0xf120)[0x55b68d443120]
[149-130-218-240:10947] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd0d6029d90]
[149-130-218-240:10947] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd0d6029e40]
[149-130-218-240:10947] [10] ./train_gpt2cu(+0x13275)[0x55b68d447275]
[149-130-218-240:10947] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10948] *** Process received signal ***
[149-130-218-240:10948] Signal: Aborted (6)
[149-130-218-240:10948] Signal code:  (-6)
[149-130-218-240:10948] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fcbac242520]
[149-130-218-240:10948] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fcbac2969fc]
[149-130-218-240:10948] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fcbac242476]
[149-130-218-240:10948] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fcbac2287f3]
[149-130-218-240:10948] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fcbac22871b]
[149-130-218-240:10948] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fcbac239e96]
[149-130-218-240:10948] [ 6] ./train_gpt2cu(+0x17762)[0x55c4774ce762]
[149-130-218-240:10948] [ 7] ./train_gpt2cu(+0xf120)[0x55c4774c6120]
[149-130-218-240:10948] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fcbac229d90]
[149-130-218-240:10948] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fcbac229e40]
[149-130-218-240:10948] [10] ./train_gpt2cu(+0x13275)[0x55c4774ca275]
[149-130-218-240:10948] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10950] *** Process received signal ***
[149-130-218-240:10950] Signal: Aborted (6)
[149-130-218-240:10950] Signal code:  (-6)
[149-130-218-240:10950] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7faae5a42520]
[149-130-218-240:10950] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7faae5a969fc]
[149-130-218-240:10950] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7faae5a42476]
[149-130-218-240:10950] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7faae5a287f3]
[149-130-218-240:10950] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7faae5a2871b]
[149-130-218-240:10950] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7faae5a39e96]
[149-130-218-240:10950] [ 6] ./train_gpt2cu(+0x17762)[0x562edaec8762]
[149-130-218-240:10950] [ 7] ./train_gpt2cu(+0xf120)[0x562edaec0120]
[149-130-218-240:10950] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7faae5a29d90]
[149-130-218-240:10950] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7faae5a29e40]
[149-130-218-240:10950] [10] ./train_gpt2cu(+0x13275)[0x562edaec4275]
[149-130-218-240:10950] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10945] *** Process received signal ***
[149-130-218-240:10945] Signal: Aborted (6)
[149-130-218-240:10945] Signal code:  (-6)
[149-130-218-240:10945] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe034642520]
[149-130-218-240:10945] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe0346969fc]
[149-130-218-240:10945] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe034642476]
[149-130-218-240:10945] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe0346287f3]
[149-130-218-240:10945] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe03462871b]
[149-130-218-240:10945] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe034639e96]
[149-130-218-240:10945] [ 6] ./train_gpt2cu(+0x17762)[0x561977d15762]
[149-130-218-240:10945] [ 7] ./train_gpt2cu(+0xf120)[0x561977d0d120]
[149-130-218-240:10945] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe034629d90]
[149-130-218-240:10945] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe034629e40]
[149-130-218-240:10945] [10] ./train_gpt2cu(+0x13275)[0x561977d11275]
[149-130-218-240:10945] *** End of error message ***
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10946] *** Process received signal ***
[149-130-218-240:10946] Signal: Aborted (6)
[149-130-218-240:10946] Signal code:  (-6)
[149-130-218-240:10946] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4bd8842520]
[149-130-218-240:10946] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f4bd88969fc]
[149-130-218-240:10946] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4bd8842476]
[149-130-218-240:10946] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f4bd88287f3]
[149-130-218-240:10946] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f4bd882871b]
[149-130-218-240:10946] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4bd8839e96]
[149-130-218-240:10946] [ 6] ./train_gpt2cu(+0x17762)[0x5637c07ba762]
[149-130-218-240:10946] [ 7] ./train_gpt2cu(+0xf120)[0x5637c07b2120]
[149-130-218-240:10946] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4bd8829d90]
[149-130-218-240:10946] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4bd8829e40]
[149-130-218-240:10946] [10] ./train_gpt2cu(+0x13275)[0x5637c07b6275]
[149-130-218-240:10946] *** End of error message ***
| weight init method    | gpt2_124M_bf16.bin                                 |
| max_sequence_length T | 1024                                               |
| vocab_size V          | 50257                                              |
| padded_vocab_size Vp  | 50304                                              |
| num_layers L          | 12                                                 |
| num_heads NH          | 12                                                 |
| channels C            | 768                                                |
| num_parameters        | 124475904                                          |
+-----------------------+----------------------------------------------------+
train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed.
[149-130-218-240:10944] *** Process received signal ***
[149-130-218-240:10944] Signal: Aborted (6)
[149-130-218-240:10944] Signal code:  (-6)
[149-130-218-240:10944] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f21acc42520]
[149-130-218-240:10944] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f21acc969fc]
[149-130-218-240:10944] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f21acc42476]
[149-130-218-240:10944] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f21acc287f3]
[149-130-218-240:10944] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f21acc2871b]
[149-130-218-240:10944] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f21acc39e96]
[149-130-218-240:10944] [ 6] ./train_gpt2cu(+0x17762)[0x55d509142762]
[149-130-218-240:10944] [ 7] ./train_gpt2cu(+0xf120)[0x55d50913a120]
[149-130-218-240:10944] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f21acc29d90]
[149-130-218-240:10944] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f21acc29e40]
[149-130-218-240:10944] [10] ./train_gpt2cu(+0x13275)[0x55d50913e275]
[149-130-218-240:10944] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node 149-130-218-240 exited on signal 6 (Aborted).

MPI runs with 4 or 6 GPUs works just fine.

msharmavikram commented 1 month ago

I am running this on CUDA 12.2 version - without cuDNN on Lamdhalabs cloud.