I failed to train the LLaMA2 model with the TP=1 strategy using 8 H800 GPUs. However, it runs successfully when TP >=2. I have not found an effective solution yet and hope to get some help from the developers. Thank you! The specific error and environment details are as follows:

error:

training ... (min, max) time across ranks (ms): model-and-optimizer-setup ......................: (15387.72, 17077.63) train/valid/test-data-iterators-setup ..........: (5206.13, 5429.87) [before the start of training step] datetime: 2024-07-19 16:09:07 /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [97,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [98,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [99,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [100,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [101,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [102,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [103,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [104,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [105,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [106,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [295,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed. /opt/conda/conda-bld/pytorch_1695392067780/work/aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [294,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "/data1/shuyang/megatron_ds/tmp/Megatron-DeepSpeed/pretrain_gpt.py", line 356, in pretrain(train_valid_test_datasets_provider, File "/data1/shuyang/megatron_ds/tmp/Megatron-DeepSpeed/megatron/training.py", line 227, in pretrain iteration = train(forward_step_func, File "/data1/shuyang/megatron_ds/tmp/Megatron-DeepSpeed/megatron/training.py", line 1211, in train train_step(forward_step_func, File "/data1/shuyang/megatron_ds/tmp/Megatron-DeepSpeed/megatron/training.py", line 670, in train_step loss = model[0].train_batch(data_iter=data_iterator) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 377, in train_batch self._exec_schedule(sched) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1434, in _exec_schedule self._exec_instr(cmd.kwargs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 724, in _exec_forward_pass outputs = super().forward(inputs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn ret_val = func(*args, *kwargs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward loss = self.module(inputs, kwargs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 351, in forward x = func(forward_input) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 344, in exec_func inputs = layer(inputs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/data1/shuyang/megatron_ds/tmp/Megatron-DeepSpeed/megatron/model/language_model.py", line 351, in forward embeddings = super().forward(input_ids, position_ids, tokentype_ids=tokentype_ids) File "/data1/shuyang/megatron_ds/tmp/Megatron-DeepSpeed/megatron/model/language_model.py", line 234, in forward words_embeddings = self.word_embeddings(input_ids) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/data1/shuyang/megatron_ds/tmp/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py", line 204, in forward output_parallel = F.embedding(masked_input, self.weight, File "/data1/miniconda3/envs/shuyang_py10/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

startup script

!/bin/bash

This example script is contributed by external user https://github.com/nrailgun

set -ex

######################################

Change the below configurations here

BASE_PATH=./tmp DS_CONFIG=${BASE_PATH}/deepspeed.json DATASET=../../data/my-gpt2_text_document CHECKPOINT_PATH=./tmp TOKENIZER_PATH=./tmp/tokenizer.model # offical llama tokenizer.model

TP=1 PP=2 ZERO_STAGE=1

GPUS_PER_NODE=8 MASTER_ADDR=127.0.0.1 MASTER_PORT=6002 NNODES=1 NODE_RANK=0

HIDDEN_SIZE=4096 # e.g. llama-13b: 5120 FFN_HIDDEN_SIZE=11008 # e.g. llama-13b: 13824 NUM_LAYERS=32 # e.g. llama-13b: 40 NUM_HEADS=32 # e.g. llama-13b: 40 SEQ_LENGTH=4096 NUM_KV_HEADS=32 # llama2 70B uses GQA

MICRO_BATCH_SIZE=1 GLOBAL_BATCH_SIZE=24 # e.g. llama: 4M tokens TRAIN_STEPS=2 # e.g. llama: 1T tokens / 4M tokens_per_batch = 250000 steps LR=3e-4 MIN_LR=3e-5 LR_WARMUP_STEPS=1 WEIGHT_DECAY=0.1 GRAD_CLIP=1

Activation checkpointing saves GPU memory, but reduces training speed

activation_checkpoint="true"

activation_checkpoint="false"

Below configuration required for llama model as per llama paper

--no-query-key-layer-scaling \

--attention-dropout 0 \

--hidden-dropout 0 \

--use-rotary-position-embeddings \

--untie-embeddings-and-output-weights \

--swiglu \

--normalization rmsnorm \

--disable-bias-linear \

######################################

cat < $DS_CONFIG { "train_batch_size" : $GLOBAL_BATCH_SIZE, "train_micro_batch_size_per_gpu": $MICRO_BATCH_SIZE, "steps_per_print": 1, "zero_optimization": { "stage": $ZERO_STAGE }, "fp16": { "enabled": true } } EOT

ds_args="" ds_args=" --deepspeed ${ds_args}" ds_args=" --deepspeed_config=$DS_CONFIG ${ds_args}" ds_args=" --zero-stage=$ZERO_STAGE ${ds_args}"

if [ "${activation_checkpoint}" = "true" ]; then ds_args="--deepspeed-activation-checkpointing ${ds_args}"

old argument for recomputing the transformer layer

ds_args="--checkpoint-activations ${ds_args}"

new argument for recomputing the transformer layer

ds_args="--recompute-granularity full --recompute-method uniform ${ds_args}"

new argument for recomputing only the attention layer

ds_args="--recompute-granularity selective ${ds_args}" fi

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"

torchrun $DISTRIBUTED_ARGS \ pretrain_gpt.py \ --tensor-model-parallel-size $TP \ --pipeline-model-parallel-size $PP \ --num-layers $NUM_LAYERS \ --hidden-size $HIDDEN_SIZE \ --ffn-hidden-size $FFN_HIDDEN_SIZE \ --num-attention-heads $NUM_HEADS \ --micro-batch-size $MICRO_BATCH_SIZE \ --global-batch-size $GLOBAL_BATCH_SIZE \ --seq-length $SEQ_LENGTH \ --max-position-embeddings $SEQ_LENGTH \ --train-iters $TRAIN_STEPS \ --data-path $DATASET \ --data-impl mmap \ --tokenizer-type GPTSentencePieceTokenizer \ --tokenizer-model $TOKENIZER_PATH \ --split 949,50,1 \ --distributed-backend nccl \ --lr $LR \ --lr-decay-style cosine \ --min-lr $MIN_LR \ --weight-decay $WEIGHT_DECAY \ --clip-grad $GRAD_CLIP \ --lr-warmup-iters $LR_WARMUP_STEPS \ --optimizer adam \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --log-interval 1 \ --save-interval 10000 \ --eval-interval 1000 \ --eval-iters 0 \ --fp16 \ --no-query-key-layer-scaling \ --attention-dropout 0 \ --hidden-dropout 0 \ --use-rotary-position-embeddings \ --untie-embeddings-and-output-weights \ --swiglu \ --normalization rmsnorm \ --disable-bias-linear \ --num-key-value-heads $NUM_KV_HEADS \ $ds_args

Dependency version

Megatron: git_hash=7eb36a1 (main branch) torch version 2.1.0
torch cuda version 12.1 nvcc version 12.0 deepspeed info ................... 0.14.5+78c6c449, 78c6c449, master deepspeed-kernels 0.0.1.dev1698255861 apex 0.1 cmake 3.29.5.1 einops 0.8.0 huggingface-hub 0.23.4 identify 2.5.36 Jinja2 3.1.4 mpi4py 3.1.6 nltk 3.8.1 sentencepiece 0.2.0 six 1.16.0 tokenizers 0.19.1 transformers 4.41.2

microsoft / Megatron-DeepSpeed

Bug: TP=1, pretrain_llama2_distributed failed on H800 gpus! #425