Open msharmavikram opened 1 month ago
mpirun -np 8 ./train_gpt2cu +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | train data pattern | dev/data/tinyshakespeare/tiny_shakespeare_train.bin | | val data pattern | dev/data/tinyshakespeare/tiny_shakespeare_val.bin | | output log dir | NULL | | checkpoint_every | 0 | | resume | 0 | | micro batch size B | 4 | | sequence length T | 1024 | | total batch size | 32768 | | LR scheduler | cosine | | learning rate (LR) | 3.000000e-04 | | warmup iterations | 0 | | final LR fraction | 1.000000e+00 | | weight decay | 0.000000e+00 | | skip update lossz | 0.000000 | | skip update gradz | 0.000000 | | max_steps | -1 | | val_loss_every | 20 | | val_max_steps | 20 | | sample_every | 20 | | genT | 64 | | overfit_single_batch | 0 | | use_master_weights | enabled | | gelu_fusion | 0 | | recompute | 1 | +-----------------------+----------------------------------------------------+ | device | NVIDIA A100-SXM4-80GB | | peak TFlops | 312.0 | | precision | BF16 | +-----------------------+----------------------------------------------------+ train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10951] *** Process received signal *** [149-130-218-240:10951] Signal: Aborted (6) [149-130-218-240:10951] Signal code: (-6) [149-130-218-240:10951] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe612442520] [149-130-218-240:10951] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe6124969fc] [149-130-218-240:10951] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe612442476] [149-130-218-240:10951] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe6124287f3] [149-130-218-240:10951] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe61242871b] [149-130-218-240:10951] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe612439e96] [149-130-218-240:10951] [ 6] ./train_gpt2cu(+0x17762)[0x55f5ea98f762] [149-130-218-240:10951] [ 7] ./train_gpt2cu(+0xf120)[0x55f5ea987120] [149-130-218-240:10951] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe612429d90] [149-130-218-240:10951] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe612429e40] [149-130-218-240:10951] [10] ./train_gpt2cu(+0x13275)[0x55f5ea98b275] [149-130-218-240:10951] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10949] *** Process received signal *** [149-130-218-240:10949] Signal: Aborted (6) [149-130-218-240:10949] Signal code: (-6) [149-130-218-240:10949] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4969642520] [149-130-218-240:10949] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f49696969fc] [149-130-218-240:10949] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4969642476] [149-130-218-240:10949] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f49696287f3] [149-130-218-240:10949] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f496962871b] [149-130-218-240:10949] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4969639e96] [149-130-218-240:10949] [ 6] ./train_gpt2cu(+0x17762)[0x55756a4e6762] [149-130-218-240:10949] [ 7] ./train_gpt2cu(+0xf120)[0x55756a4de120] [149-130-218-240:10949] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4969629d90] [149-130-218-240:10949] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4969629e40] [149-130-218-240:10949] [10] ./train_gpt2cu(+0x13275)[0x55756a4e2275] [149-130-218-240:10949] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10947] *** Process received signal *** [149-130-218-240:10947] Signal: Aborted (6) [149-130-218-240:10947] Signal code: (-6) [149-130-218-240:10947] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fd0d6042520] [149-130-218-240:10947] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fd0d60969fc] [149-130-218-240:10947] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fd0d6042476] [149-130-218-240:10947] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fd0d60287f3] [149-130-218-240:10947] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fd0d602871b] [149-130-218-240:10947] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fd0d6039e96] [149-130-218-240:10947] [ 6] ./train_gpt2cu(+0x17762)[0x55b68d44b762] [149-130-218-240:10947] [ 7] ./train_gpt2cu(+0xf120)[0x55b68d443120] [149-130-218-240:10947] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fd0d6029d90] [149-130-218-240:10947] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fd0d6029e40] [149-130-218-240:10947] [10] ./train_gpt2cu(+0x13275)[0x55b68d447275] [149-130-218-240:10947] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10948] *** Process received signal *** [149-130-218-240:10948] Signal: Aborted (6) [149-130-218-240:10948] Signal code: (-6) [149-130-218-240:10948] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fcbac242520] [149-130-218-240:10948] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fcbac2969fc] [149-130-218-240:10948] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fcbac242476] [149-130-218-240:10948] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fcbac2287f3] [149-130-218-240:10948] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fcbac22871b] [149-130-218-240:10948] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fcbac239e96] [149-130-218-240:10948] [ 6] ./train_gpt2cu(+0x17762)[0x55c4774ce762] [149-130-218-240:10948] [ 7] ./train_gpt2cu(+0xf120)[0x55c4774c6120] [149-130-218-240:10948] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fcbac229d90] [149-130-218-240:10948] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fcbac229e40] [149-130-218-240:10948] [10] ./train_gpt2cu(+0x13275)[0x55c4774ca275] [149-130-218-240:10948] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10950] *** Process received signal *** [149-130-218-240:10950] Signal: Aborted (6) [149-130-218-240:10950] Signal code: (-6) [149-130-218-240:10950] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7faae5a42520] [149-130-218-240:10950] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7faae5a969fc] [149-130-218-240:10950] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7faae5a42476] [149-130-218-240:10950] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7faae5a287f3] [149-130-218-240:10950] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7faae5a2871b] [149-130-218-240:10950] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7faae5a39e96] [149-130-218-240:10950] [ 6] ./train_gpt2cu(+0x17762)[0x562edaec8762] [149-130-218-240:10950] [ 7] ./train_gpt2cu(+0xf120)[0x562edaec0120] [149-130-218-240:10950] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7faae5a29d90] [149-130-218-240:10950] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7faae5a29e40] [149-130-218-240:10950] [10] ./train_gpt2cu(+0x13275)[0x562edaec4275] [149-130-218-240:10950] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10945] *** Process received signal *** [149-130-218-240:10945] Signal: Aborted (6) [149-130-218-240:10945] Signal code: (-6) [149-130-218-240:10945] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fe034642520] [149-130-218-240:10945] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fe0346969fc] [149-130-218-240:10945] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fe034642476] [149-130-218-240:10945] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fe0346287f3] [149-130-218-240:10945] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7fe03462871b] [149-130-218-240:10945] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7fe034639e96] [149-130-218-240:10945] [ 6] ./train_gpt2cu(+0x17762)[0x561977d15762] [149-130-218-240:10945] [ 7] ./train_gpt2cu(+0xf120)[0x561977d0d120] [149-130-218-240:10945] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7fe034629d90] [149-130-218-240:10945] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7fe034629e40] [149-130-218-240:10945] [10] ./train_gpt2cu(+0x13275)[0x561977d11275] [149-130-218-240:10945] *** End of error message *** train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10946] *** Process received signal *** [149-130-218-240:10946] Signal: Aborted (6) [149-130-218-240:10946] Signal code: (-6) [149-130-218-240:10946] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f4bd8842520] [149-130-218-240:10946] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f4bd88969fc] [149-130-218-240:10946] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f4bd8842476] [149-130-218-240:10946] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f4bd88287f3] [149-130-218-240:10946] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f4bd882871b] [149-130-218-240:10946] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f4bd8839e96] [149-130-218-240:10946] [ 6] ./train_gpt2cu(+0x17762)[0x5637c07ba762] [149-130-218-240:10946] [ 7] ./train_gpt2cu(+0xf120)[0x5637c07b2120] [149-130-218-240:10946] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f4bd8829d90] [149-130-218-240:10946] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f4bd8829e40] [149-130-218-240:10946] [10] ./train_gpt2cu(+0x13275)[0x5637c07b6275] [149-130-218-240:10946] *** End of error message *** | weight init method | gpt2_124M_bf16.bin | | max_sequence_length T | 1024 | | vocab_size V | 50257 | | padded_vocab_size Vp | 50304 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124475904 | +-----------------------+----------------------------------------------------+ train_gpt2cu: llmc/dataloader.h:186: void dataloader_init(DataLoader*, const char*, size_t, size_t, int, int, int): Assertion `shard_ntok >= (int64_t) (num_processes * B * T + 1)' failed. [149-130-218-240:10944] *** Process received signal *** [149-130-218-240:10944] Signal: Aborted (6) [149-130-218-240:10944] Signal code: (-6) [149-130-218-240:10944] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f21acc42520] [149-130-218-240:10944] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f21acc969fc] [149-130-218-240:10944] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f21acc42476] [149-130-218-240:10944] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f21acc287f3] [149-130-218-240:10944] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2871b)[0x7f21acc2871b] [149-130-218-240:10944] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f21acc39e96] [149-130-218-240:10944] [ 6] ./train_gpt2cu(+0x17762)[0x55d509142762] [149-130-218-240:10944] [ 7] ./train_gpt2cu(+0xf120)[0x55d50913a120] [149-130-218-240:10944] [ 8] /lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f21acc29d90] [149-130-218-240:10944] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f21acc29e40] [149-130-218-240:10944] [10] ./train_gpt2cu(+0x13275)[0x55d50913e275] [149-130-218-240:10944] *** End of error message *** -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that process rank 7 with PID 0 on node 149-130-218-240 exited on signal 6 (Aborted).
MPI runs with 4 or 6 GPUs works just fine.
I am running this on CUDA 12.2 version - without cuDNN on Lamdhalabs cloud.
MPI runs with 4 or 6 GPUs works just fine.