Closed anon-nlp-dev closed 1 year ago
For example, the following lines in training.py, only show the total number of parameters in the underlying dense model, not counting the extra parameters in each MoE layer:
Consider using DeepSpeed FLOPS profiler where MoE parameter count has been added recently: https://github.com/microsoft/DeepSpeed/pull/3720
Consider using DeepSpeed FLOPS profiler where MoE parameter count has been added recently: microsoft/DeepSpeed#3720
@clumsy Thank you for responding to my question. But, it seems like the code changes made in the PR you mentioned, do not work anymore. I guess because of this particular logic:
When I tried to debug and see which group_name
does each parameter has, I get this:
AttributeError: 'Parameter' object has no attribute 'group_name'
Which DeepSpeed version are you using, @anon-nlp-dev?
group_name
is accessed only after getattr
check - so there should be no AttributeError
error.
Can you also please attach the traceback?
Which DeepSpeed version are you using, @anon-nlp-dev?
group_name
is accessed only aftergetattr
check - so there should be noAttributeError
error. Can you also please attach the traceback?
@clumsy I am using DeepSpeed Version: 0.10.0,
AttributeError
is not something the source code throws on its own, I got the error when I manually try to print the param.group_name
. You are right, I should have used getattr
instead.
I reran the experiment with the getattr
change, and got empty string for all the group_name
in each of the param
instances.
I used the following command:
python -u -m torch.distributed.run --nproc_per_node 4 --nnodes 1 --rdzv_id=12345 --rdzv_endpoint 0.0.0.0:6000 --rdzv_backend c10d --max_restarts 0 --tee 3 pretrain_gpt.py --override-opt_param-scheduler --adam-beta1 0.9 --adam-beta2 0.95 --tensor-model-parallel-size 1 --moe-expert-parallel-size 4 --num-experts 128 --moe-loss-coeff 0.01 --moe-train-capacity-factor 1.0 --moe-eval-capacity-factor 1.0 --moe-min-capacity 4 --init-method-std 0.01 --lr-decay-tokens 300000000000 --lr-warmup-tokens 375000000 --micro-batch-size 2 --exit-duration-in-mins 30000000 --global-batch-size 8 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 2048 --max-position-embeddings 2048 --train-tokens 300000000000 --train-iters 10 --lr 2.0e-4 --min-lr 2e-06 --lr-decay-style cosine --split 98,2,0 --log-interval 5 --eval-interval 100 --eval-iters 50 --save-interval 100 --weight-decay 0.1 --clip-grad 1.0 --hysteresis 2 --num-workers 0 --fp16 --load output/checkpoint/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true --save output/checkpoint/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true --tensorboard-queue-size 1 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --tensorboard-dir output/tensorboard/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true_n2dgx01_2023.08.02-17.39.02 --checkpoint-activations --create-moe-param-group --vocab-file data/gpt2-vocab.json --merge-file data/gpt2-merges.txt --data-path data/meg-gpt2-oscar-en-10k_text_document --data-impl mmap --deepspeed --deepspeed_config ds_config_gpt_gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true.json --pipeline-model-parallel-size 1 --no-pipeline-parallel --deepspeed-activation-checkpointing
with the following deepspeed config:
{
"train_batch_size" : 8,
"train_micro_batch_size_per_gpu": 2,
"steps_per_print": 5,
"zero_optimization": {
"stage": 2
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 500,
"hysteresis": 2,
"min_loss_scale": 1,
"initial_scale_power": 11
},
"bf16": {
"enabled": false
},
"curriculum_learning": {
"enabled": false,
"curriculum_type": "seqlen",
"min_difficulty": 80,
"max_difficulty": 2048,
"schedule_type": "fixed_linear",
"schedule_config": {
"total_curriculum_step": 7048872,
"difficulty_step": 8
}
},
"wall_clock_breakdown" : true,
"flops_profiler": {
"enabled": true,
"profile_step": 5,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
}
@anon-nlp-dev that's probably because the parameters you checked are non-expert.
Expert parameters have group_name
set in here.
Does the FLOPS profile print the expert parameters details for you?
@anon-nlp-dev that's probably because the parameters you checked are non-expert.
Expert parameters have
group_name
set in here.Does the FLOPS profile print the expert parameters details for you?
@clumsy The FLOPS profile doesn't print the expert parameters. It prints this:
[default0]:-------------------------- DeepSpeed Flops Profiler --------------------------
[default0]:Profile Summary at step 6:
[default0]:Notations:
[default0]:data parallel size (dp_size), model parallel size(mp_size),
[default0]:number of parameters (params), number of multiply-accumulate operations(MACs),
[default0]:number of floating-point operations (flops), floating-point operations per second (FLOPS),
[default0]:fwd latency (forward propagation latency), bwd latency (backward propagation latency),
[default0]:step (weights update latency), iter latency (sum of fwd, bwd and step latency)
[default0]:
[default0]:world size: 4
[default0]:data parallel size: 4
[default0]:model parallel size: 1
[default0]:batch size per GPU: 2
[default0]:params per gpu: 355.92 M
[default0]:params of model = params per GPU * mp_size: 355.92 M
[default0]:fwd MACs per GPU: 1860.26 GMACs
[default0]:fwd flops per GPU: 3723.74 G
[default0]:fwd flops of model = fwd flops per GPU * mp_size: 3723.74 G
[default0]:fwd latency: 457.15 ms
[default0]:fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 8.15 TFLOPS
[default0]:bwd latency: 2.93 s
[default0]:bwd FLOPS per GPU = 2.0 * fwd flops per GPU / bwd latency: 2.54 TFLOPS
[default0]:fwd+bwd FLOPS per GPU = 3.0 * fwd flops per GPU / (fwd+bwd latency): 3.3 TFLOPS
[default0]:step latency: 244.41 ms
[default0]:iter latency: 3.63 s
[default0]:step latency: 244.41 ms
[default0]:iter latency: 3.63 s
[default0]:FLOPS per GPU = 3.0 * fwd flops per GPU / iter latency: 3.08 TFLOPS
[default0]:samples/second: 2.20
[default0]:
[default0]:----------------------------- Aggregated Profile per GPU -----------------------------
[default0]:Top 1 modules in terms of params, MACs or fwd latency at different model depths:
[default0]:depth 0:
[default0]: params - {'GPTModel': '355.92 M'}
[default0]: MACs - {'GPTModel': '1860.26 GMACs'}
[default0]: fwd latency - {'GPTModel': '457.02 ms'}
[default0]:depth 1:
[default0]: params - {'TransformerLanguageModel': '355.92 M'}
[default0]: MACs - {'TransformerLanguageModel': '1649.27 GMACs'}
[default0]: fwd latency - {'TransformerLanguageModel': '448.68 ms'}
[default0]:depth 2:
[default0]: params - {'ParallelTransformer': '302.31 M'}
[default0]: MACs - {'ParallelTransformer': '1649.27 GMACs'}
[default0]: fwd latency - {'ParallelTransformer': '448.08 ms'}
[default0]:depth 3:
[default0]: params - {'ModuleList': '302.31 M'}
[default0]: MACs - {'ModuleList': '1649.27 GMACs'}
[default0]: fwd latency - {'ModuleList': '446.24 ms'}
[default0]:depth 4:
[default0]: params - {'ParallelTransformerLayer': '302.31 M'}
[default0]: MACs - {'ParallelTransformerLayer': '1649.27 GMACs'}
[default0]: fwd latency - {'ParallelTransformerLayer': '446.24 ms'}
[default0]:depth 5:
[default0]: params - {'ParallelMLP': '201.45 M'}
[default0]: MACs - {'ParallelAttention': '824.63 GMACs'}
[default0]: fwd latency - {'ParallelAttention': '271.74 ms'}
[default0]:depth 6:
[default0]: params - {'ColumnParallelLinear': '176.33 M'}
[default0]: MACs - {'ColumnParallelLinear': '721.55 GMACs'}
[default0]: fwd latency - {'CoreAttention': '148.9 ms'}
[default0]:
[default0]:------------------------------ Detailed Profile per GPU ------------------------------
[default0]:Each module profile is listed after its name in the following order:
[default0]:params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
[default0]:
[default0]:Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make u$
[default0]:2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
[default0]:3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepS$
[default0]:
[default0]:GPTModel(
[default0]: 355.92 M, 100.00% Params, 1860.26 GMACs, 100.00% MACs, 457.02 ms, 100.00% latency, 8.15 TFLOPS,
[default0]: (language_model): TransformerLanguageModel(
[default0]: 355.92 M, 100.00% Params, 1649.27 GMACs, 88.66% MACs, 448.68 ms, 98.18% latency, 7.36 TFLOPS,
[default0]: (language_model): TransformerLanguageModel(
[default0]: 355.92 M, 100.00% Params, 1649.27 GMACs, 88.66% MACs, 448.68 ms, 98.18% latency, 7.36 TFLOPS,
[default0]: (embedding): Embedding(
[default0]: 53.61 M, 15.06% Params, 0 MACs, 0.00% MACs, 532.15 us, 0.12% latency, 0.0 FLOPS,
[default0]: (word_embeddings): VocabParallelEmbedding(51.51 M, 14.47% Params, 0 MACs, 0.00% MACs, 163.79 us, 0.04% latency, 0.0 FLOPS, )
[default0]: (position_embeddings): Embedding(2.1 M, 0.59% Params, 0 MACs, 0.00% MACs, 123.74 us, 0.03% latency, 0.0 FLOPS, 2048, 1024)
[default0]: (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.15 us, 0.02% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (encoder): ParallelTransformer(
[default0]: 302.31 M, 84.94% Params, 1649.27 GMACs, 88.66% MACs, 448.08 ms, 98.04% latency, 7.37 TFLOPS,
[default0]: (layers): ModuleList(
[default0]: 302.31 M, 84.94% Params, 1649.27 GMACs, 88.66% MACs, 446.24 ms, 97.64% latency, 7.4 TFLOPS,
[default0]: (0): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 61.73 ms, 13.51% latency, 2.23 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 117.3 us, 0.03% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 60.15 ms, 13.16% latency, 1.14 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 354.53 us, 0.08% latency, 72.69 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 59.5 ms, 13.02% latency, 579.76 GFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 58.0 ms, 12.69% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 634.91 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 190.5 us, 0.04% latency, 45.09 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 92.51 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 960.35 us, 0.21% latency, 71.56 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 254.39 us, 0.06% latency, 135.07 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 305.18 us, 0.07% latency, 112.59 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (1): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.8 ms, 0.83% latency, 36.19 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.92 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.38 ms, 0.52% latency, 28.92 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 302.79 us, 0.07% latency, 85.11 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.01 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 437.74 us, 0.10% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 594.14 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 156.88 us, 0.03% latency, 54.76 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 84.64 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 922.44 us, 0.20% latency, 74.5 TFLOPS,
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 922.44 us, 0.20% latency, 74.5 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 250.1 us, 0.05% latency, 137.38 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.49 us, 0.06% latency, 117.07 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (2): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 4.03 ms, 0.88% latency, 34.17 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 84.4 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.6 ms, 0.57% latency, 26.53 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.74 us, 0.07% latency, 86.26 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 2.04 ms, 0.45% latency, 16.9 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.29 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 811.58 us, 0.18% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 156.16 us, 0.03% latency, 55.01 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 922.2 us, 0.20% latency, 74.52 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 250.34 us, 0.05% latency, 137.25 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 294.45 us, 0.06% latency, 116.69 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (3): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.78 ms, 0.83% latency, 36.35 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.3 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.37 ms, 0.52% latency, 29.08 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.26 us, 0.07% latency, 86.4 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.82 ms, 0.40% latency, 18.98 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.77 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 613.69 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 153.54 us, 0.03% latency, 55.95 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 91.55 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 918.39 us, 0.20% latency, 74.83 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.39 us, 0.05% latency, 137.78 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.02 us, 0.06% latency, 117.26 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (4): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.8 ms, 0.83% latency, 36.22 TFLOPS,
[default0]: (4): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.8 ms, 0.83% latency, 36.22 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.73 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.37 ms, 0.52% latency, 29.05 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 304.94 us, 0.07% latency, 84.51 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.06 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.53 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 600.58 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 156.4 us, 0.03% latency, 54.92 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.21 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 927.45 us, 0.20% latency, 74.1 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 251.77 us, 0.06% latency, 136.47 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.3 us, 0.06% latency, 117.55 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (5): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 50.69 ms, 11.09% latency, 2.71 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.25 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 49.22 ms, 10.77% latency, 1.4 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 297.07 us, 0.07% latency, 86.75 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 48.64 ms, 10.64% latency, 709.15 GFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 47.18 ms, 10.32% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 634.91 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 175.48 us, 0.04% latency, 48.95 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 90.36 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 943.42 us, 0.21% latency, 72.84 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 258.45 us, 0.06% latency, 132.95 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 298.26 us, 0.07% latency, 115.2 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (6): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.81 ms, 0.83% latency, 36.13 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.02 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.39 ms, 0.52% latency, 28.77 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 309.47 us, 0.07% latency, 83.27 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.82 ms, 0.40% latency, 18.93 TFLOPS,
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.82 ms, 0.40% latency, 18.93 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 432.01 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 604.87 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.21 us, 0.03% latency, 55.34 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 921.73 us, 0.20% latency, 74.56 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.86 us, 0.05% latency, 137.51 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.54 us, 0.06% latency, 117.45 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (7): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.83 ms, 0.84% latency, 35.96 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.82 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.42 ms, 0.53% latency, 28.49 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 302.55 us, 0.07% latency, 85.17 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.86 ms, 0.41% latency, 18.57 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 424.86 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 633.48 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 157.59 us, 0.03% latency, 54.51 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 916.72 us, 0.20% latency, 74.96 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 247.72 us, 0.05% latency, 138.71 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 291.82 us, 0.06% latency, 117.74 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (8): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 32.06 ms, 7.02% latency, 4.29 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.25 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 30.61 ms, 6.70% latency, 2.25 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 28.44 ms, 6.22% latency, 906.14 GFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.89 ms, 0.41% latency, 18.23 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 439.64 us, 0.10% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 644.45 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 168.8 us, 0.04% latency, 50.89 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 86.78 us, 0.02% latency, 0.0 FLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 86.78 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 930.31 us, 0.20% latency, 73.87 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 250.1 us, 0.05% latency, 137.38 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 295.4 us, 0.06% latency, 116.32 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (9): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.85 ms, 0.84% latency, 35.72 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.21 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.44 ms, 0.53% latency, 28.27 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 304.46 us, 0.07% latency, 84.64 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.87 ms, 0.41% latency, 18.41 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 435.59 us, 0.10% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 659.94 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.93 us, 0.03% latency, 55.09 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.21 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 923.4 us, 0.20% latency, 74.42 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.86 us, 0.05% latency, 137.51 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.97 us, 0.06% latency, 116.88 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (10): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 53.96 ms, 11.81% latency, 2.55 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.82 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.41 ms, 0.53% latency, 28.53 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 300.65 us, 0.07% latency, 85.71 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.85 ms, 0.40% latency, 18.69 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 425.34 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 639.44 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.45 us, 0.03% latency, 55.26 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.54 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 51.03 ms, 11.17% latency, 1.35 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.15 us, 0.05% latency, 137.91 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 351.91 us, 0.08% latency, 97.64 TFLOPS, )
[default0]: )
[default0]: )
[default0]: )
[default0]: )
[default0]: (11): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.93 ms, 0.86% latency, 35.04 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 88.93 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.48 ms, 0.54% latency, 27.78 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 310.42 us, 0.07% latency, 83.02 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.9 ms, 0.42% latency, 18.11 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 444.65 us, 0.10% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 643.49 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 156.64 us, 0.03% latency, 54.84 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.92 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 928.64 us, 0.20% latency, 74.0 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 252.49 us, 0.06% latency, 136.09 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.54 us, 0.06% latency, 117.45 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (12): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.81 ms, 0.83% latency, 36.15 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.3 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.4 ms, 0.52% latency, 28.72 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 301.36 us, 0.07% latency, 85.51 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.83 ms, 0.40% latency, 18.83 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.05 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 607.73 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.93 us, 0.03% latency, 55.09 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.49 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 918.15 us, 0.20% latency, 74.85 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 250.1 us, 0.05% latency, 137.38 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.02 us, 0.06% latency, 117.26 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (13): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 25.87 ms, 5.66% latency, 5.32 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 92.51 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 24.41 ms, 5.34% latency, 2.82 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 318.29 us, 0.07% latency, 80.96 TFLOPS, )
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 24.41 ms, 5.34% latency, 2.82 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 318.29 us, 0.07% latency, 80.96 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.8 ms, 0.39% latency, 19.12 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 423.67 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 608.21 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 22.19 ms, 4.85% latency, 387.19 GFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 85.35 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 933.89 us, 0.20% latency, 73.58 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 253.44 us, 0.06% latency, 135.57 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.49 us, 0.06% latency, 117.07 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (14): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.78 ms, 0.83% latency, 36.35 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.25 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.37 ms, 0.52% latency, 29.1 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 303.98 us, 0.07% latency, 84.77 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.1 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.29 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 594.38 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 153.78 us, 0.03% latency, 55.86 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.45 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 918.39 us, 0.20% latency, 74.83 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 251.29 us, 0.05% latency, 136.73 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.3 us, 0.06% latency, 117.55 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (15): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.91 ms, 0.86% latency, 35.14 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.97 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.5 ms, 0.55% latency, 27.53 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.02 us, 0.07% latency, 86.47 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.95 ms, 0.43% latency, 17.72 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 435.83 us, 0.10% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 618.7 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.93 us, 0.03% latency, 55.09 TFLOPS, )
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.93 us, 0.03% latency, 55.09 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 84.16 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 912.67 us, 0.20% latency, 75.3 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 248.91 us, 0.05% latency, 138.04 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.87 us, 0.06% latency, 118.13 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (16): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.79 ms, 0.83% latency, 36.27 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.78 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.38 ms, 0.52% latency, 28.98 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 297.78 us, 0.07% latency, 86.54 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.82 ms, 0.40% latency, 18.91 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 425.34 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 623.7 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 154.26 us, 0.03% latency, 55.69 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 84.88 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 923.63 us, 0.20% latency, 74.4 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 248.67 us, 0.05% latency, 138.17 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 302.79 us, 0.07% latency, 113.48 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (17): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 73.86 ms, 16.16% latency, 1.86 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.35 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.4 ms, 0.53% latency, 28.64 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.5 us, 0.07% latency, 86.33 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.85 ms, 0.40% latency, 18.67 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 424.62 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 638.72 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 158.07 us, 0.03% latency, 54.34 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.92 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 917.2 us, 0.20% latency, 74.92 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 251.05 us, 0.05% latency, 136.86 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.63 us, 0.06% latency, 118.22 TFLOPS, )
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 251.05 us, 0.05% latency, 136.86 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.63 us, 0.06% latency, 118.22 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (18): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 4.02 ms, 0.88% latency, 34.26 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 105.86 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.5 ms, 0.55% latency, 27.49 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 340.22 us, 0.07% latency, 75.74 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.88 ms, 0.41% latency, 18.3 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 442.27 us, 0.10% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 638.25 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 169.28 us, 0.04% latency, 50.74 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 91.55 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 946.28 us, 0.21% latency, 72.62 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.39 us, 0.05% latency, 137.78 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 311.37 us, 0.07% latency, 110.35 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (19): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.78 ms, 0.83% latency, 36.44 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.49 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.36 ms, 0.52% latency, 29.14 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 299.22 us, 0.07% latency, 86.12 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.06 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.05 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 603.68 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.69 us, 0.03% latency, 55.17 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 914.57 us, 0.20% latency, 75.14 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.62 us, 0.05% latency, 137.65 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.16 us, 0.06% latency, 118.42 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (20): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 23.72 ms, 5.19% latency, 5.8 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.59 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.59 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.38 ms, 0.52% latency, 28.94 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 299.93 us, 0.07% latency, 85.92 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.83 ms, 0.40% latency, 18.87 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 425.58 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 630.14 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 153.54 us, 0.03% latency, 55.95 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 85.12 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 20.85 ms, 4.56% latency, 3.3 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 20.17 ms, 4.41% latency, 1.7 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 298.74 us, 0.07% latency, 115.02 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (21): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.77 ms, 0.83% latency, 36.45 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.78 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.36 ms, 0.52% latency, 29.15 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 299.69 us, 0.07% latency, 85.99 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.07 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.29 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 595.57 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.21 us, 0.03% latency, 55.34 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 913.14 us, 0.20% latency, 75.26 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 248.91 us, 0.05% latency, 138.04 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.06 us, 0.06% latency, 117.65 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (22): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.8 ms, 0.83% latency, 36.22 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.45 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.39 ms, 0.52% latency, 28.76 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.98 us, 0.07% latency, 86.19 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.84 ms, 0.40% latency, 18.76 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 432.49 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 617.27 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 432.49 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 617.27 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 159.03 us, 0.03% latency, 54.02 TFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.21 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 914.1 us, 0.20% latency, 75.18 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.62 us, 0.05% latency, 137.65 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.39 us, 0.06% latency, 118.32 TFLOPS, )
[default0]: )
[default0]: )
[default0]: (23): ParallelTransformerLayer(
[default0]: 12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 62.85 ms, 13.75% latency, 2.19 TFLOPS,
[default0]: (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.59 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (self_attention): ParallelAttention(
[default0]: 4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 61.44 ms, 13.44% latency, 1.12 TFLOPS,
[default0]: (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 309.94 us, 0.07% latency, 83.14 TFLOPS, )
[default0]: (core_attention): CoreAttention(
[default0]: 0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.85 ms, 0.41% latency, 18.62 TFLOPS,
[default0]: (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 424.15 us, 0.09% latency, 0.0 FLOPS, )
[default0]: (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 642.3 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]: )
[default0]: (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 59.18 ms, 12.95% latency, 145.15 GFLOPS, )
[default0]: )
[default0]: (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 85.59 us, 0.02% latency, 0.0 FLOPS, )
[default0]: (mlp): ParallelMLP(
[default0]: 8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 916.48 us, 0.20% latency, 74.98 TFLOPS,
[default0]: (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 252.49 us, 0.06% latency, 136.09 TFLOPS, )
[default0]: (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 289.2 us, 0.06% latency, 118.81 TFLOPS, )
[default0]: )
[default0]: )
[default0]: )
[default0]: (final_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.54 us, 0.02% latency, 0.0 FLOPS, )
[default0]: )
[default0]: )
[default0]:)
[default0]:------------------------------------------------------------------------------
@anon-nlp-dev that's not surprising, look at the model layers breakdown. Your model is fully dense as it only has ParallelMLP
, no MoE
layers.
@clumsy Yes, that seems to be the case. I wonder what did I get wrong for the model to not configure the MoE layers, I have the following argument values:
[default0]:------------------------ arguments ------------------------
[default0]: accumulate_allreduce_grads_in_fp32 .............. False
[default0]: adam_beta1 ...................................... 0.9
[default0]: adam_beta2 ...................................... 0.95
[default0]: adam_eps ........................................ 1e-08
[default0]: add_bias_linear ................................. True
[default0]: add_position_embedding .......................... True
[default0]: adlr_autoresume ................................. False
[default0]: adlr_autoresume_interval ........................ 1000
[default0]: aml_data_download_path .......................... None
[default0]: apply_layernorm_1p .............................. False
[default0]: apply_query_key_layer_scaling ................... True
[default0]: apply_residual_connection_post_layernorm ........ False
[default0]: async_tensor_model_parallel_allreduce ........... False
[default0]: attention_dropout ............................... 0.1
[default0]: attention_softmax_in_fp32 ....................... False
[default0]: barrier_with_L1_time ............................ True
[default0]: bert_binary_head ................................ True
[default0]: bert_embedder_type .............................. megatron
[default0]: bert_load ....................................... None
[default0]: bf16 ............................................ False
[default0]: bias_dropout_fusion ............................. True
[default0]: bias_gelu_fusion ................................ True
[default0]: biencoder_projection_dim ........................ 0
[default0]: biencoder_shared_query_context_model ............ False
[default0]: block_data_path ................................. None
[default0]: checkpoint_activations .......................... True
[default0]: checkpoint_in_cpu ............................... False
[default0]: checkpoint_num_layers ........................... 1
[default0]: classes_fraction ................................ 1.0
[default0]: clip_grad ....................................... 1.0
[default0]: compression_training ............................ False
[default0]: consumed_train_samples .......................... 0
[default0]: consumed_train_tokens ........................... 0
[default0]: consumed_valid_samples .......................... 0
[default0]: contigious_checkpointing ........................ False
[default0]: cpu_optimizer ................................... False
[default0]: cpu_torch_adam .................................. False
[default0]: create_moe_param_group .......................... True
[default0]: curriculum_learning_legacy ...................... False
[default0]: data_cache_path ................................. None
[default0]: data_efficiency_curriculum_learning ............. False
[default0]: data_impl ....................................... mmap
[default0]: data_parallel_random_init ....................... False
[default0]: data_parallel_size .............................. 4
[default0]: data_path ....................................... ['data/meg-gpt2-oscar-en-10k_text_document']
[default0]: data_per_class_fraction ......................... 1.0
[default0]: data_sharding ................................... True
[default0]: dataloader_type ................................. single
[default0]: DDP_impl ........................................ local
[default0]: decoder_num_layers .............................. None
[default0]: decoder_seq_length .............................. None
[default0]: deepscale ....................................... False
[default0]: deepscale_config ................................ None
[default0]: deepspeed ....................................... True
[default0]: deepspeed_activation_checkpointing .............. True
[default0]: deepspeed_config ................................ ds_config_gpt_gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true.json
[default0]: deepspeed_mpi ................................... False
[default0]: dino_bottleneck_size ............................ 256
[default0]: dino_freeze_last_layer .......................... 1
[default0]: dino_head_hidden_size ........................... 2048
[default0]: dino_local_crops_number ......................... 10
[default0]: dino_local_img_size ............................. 96
[default0]: dino_norm_last_layer ............................ False
[default0]: dino_teacher_temp ............................... 0.07
[default0]: dino_warmup_teacher_temp ........................ 0.04
[default0]: dino_warmup_teacher_temp_epochs ................. 30
[default0]: distribute_checkpointed_activations ............. False
[default0]: distribute_saved_activations .................... False
[default0]: distributed_backend ............................. nccl
[default0]: distributed_timeout_minutes ..................... 10
[default0]: ds_inference .................................... False
[default0]: ds_pipeline_enabled ............................. False
[default0]: embedding_path .................................. None
[default0]: embedding_weights_in_fp32 ....................... False
[default0]: empty_unused_memory_level ....................... 0
[default0]: enable_expert_tensor_parallelism ................ False
[default0]: encoder_num_layers .............................. 24
[default0]: encoder_seq_length .............................. 2048
[default0]: end_weight_decay ................................ 0.1
[default0]: eod_mask_loss ................................... False
[default0]: eval_interval ................................... 100
[default0]: eval_iters ...................................... 50
[default0]: evidence_data_path .............................. None
[default0]: exit_duration_in_mins ........................... 30000000
[default0]: exit_interval ................................... None
[default0]: exit_on_missing_checkpoint ...................... False
[default0]: exit_signal_handler ............................. False
[default0]: expert_interval ................................. 2
[default0]: ffn_hidden_size ................................. 4096
[default0]: finetune ........................................ False
[default0]: fp16 ............................................ True
[default0]: fp16_lm_cross_entropy ........................... False
[default0]: fp32_residual_connection ........................ False
[default0]: fp8_amax_compute_algo ........................... most_recent
[default0]: fp8_amax_history_len ............................ 1
[default0]: fp8_e4m3 ........................................ False
[default0]: fp8_hybrid ...................................... False
[default0]: fp8_interval .................................... 1
[default0]: fp8_margin ...................................... 0
[default0]: fp8_wgrad ....................................... True
[default0]: global_batch_size ............................... 8
[default0]: gradient_accumulation_fusion .................... True
[default0]: head_lr_mult .................................... 1.0
[default0]: hidden_dropout .................................. 0.1
[default0]: hidden_size ..................................... 1024
[default0]: hidden_size_teacher ............................. None
[default0]: hysteresis ...................................... 2
[default0]: ict_head_size ................................... None
[default0]: ict_load ........................................ None
[default0]: img_h ........................................... 224
[default0]: img_w ........................................... 224
[default0]: indexer_batch_size .............................. 128
[default0]: indexer_log_interval ............................ 1000
[default0]: inference ....................................... False
[default0]: inference_batch_times_seqlen_threshold .......... 512
[default0]: init_method_std ................................. 0.01
[default0]: init_method_xavier_uniform ...................... False
[default0]: initial_loss_scale .............................. 4294967296
[default0]: iter_per_epoch .................................. 1250
[default0]: kd .............................................. False
[default0]: kd_alpha_ce ..................................... 1
[default0]: kd_beta_ce ...................................... 1
[default0]: kd_temp ......................................... 1.0
[default0]: kv_channels ..................................... 64
[default0]: layernorm_epsilon ............................... 1e-05
[default0]: lazy_mpu_init ................................... None
[default0]: load ............................................ output/checkpoint/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true
[default0]: load_teacher .................................... None
[default0]: local_rank ...................................... None
[default0]: log_batch_size_to_tensorboard ................... True
[default0]: log_interval .................................... 5
[default0]: log_learning_rate_to_tensorboard ................ True
[default0]: log_loss_scale_to_tensorboard ................... True
[default0]: log_memory_to_tensorboard ....................... False
[default0]: log_num_zeros_in_grad ........................... False
[default0]: log_optimizer_states_to_tensorboard ............. False
[default0]: log_params_norm ................................. False
[default0]: log_timers_to_tensorboard ....................... True
[default0]: log_validation_ppl_to_tensorboard ............... True
[default0]: log_world_size_to_tensorboard ................... False
[default0]: loss_scale ...................................... None
[default0]: loss_scale_window ............................... 1000
[default0]: lr .............................................. 0.0002
[default0]: lr_decay_iters .................................. None
[default0]: lr_decay_samples ................................ None
[default0]: lr_decay_style .................................. cosine
[default0]: lr_decay_tokens ................................. 300000000000
[default0]: lr_warmup_fraction .............................. None
[default0]: lr_warmup_iters ................................. 0
[default0]: lr_warmup_samples ............................... 0
[default0]: lr_warmup_tokens ................................ 375000000
[default0]: make_vocab_size_divisible_by .................... 128
[default0]: mask_factor ..................................... 1.0
[default0]: mask_prob ....................................... 0.15
[default0]: mask_type ....................................... random
[default0]: masked_softmax_fusion ........................... True
[default0]: max_position_embeddings ......................... 2048
[default0]: max_tokens_to_oom ............................... 12000
[default0]: memory_centric_tiled_linear ..................... False
[default0]: merge_file ...................................... data/gpt2-merges.txt
[default0]: micro_batch_size ................................ 2
[default0]: min_loss_scale .................................. 1.0
[default0]: min_lr .......................................... 2e-06
[default0]: mlp_type ........................................ standard
[default0]: mmap_warmup ..................................... False
[default0]: moe_eval_capacity_factor ........................ 1.0
[default0]: moe_expert_parallel_size ........................ 4
[default0]: moe_loss_coeff .................................. 0.01
[default0]: moe_min_capacity ................................ 4
[default0]: moe_token_dropping .............................. True
[default0]: moe_train_capacity_factor ....................... 1.0
[default0]: mos ............................................. False
[default0]: no_load_lr_state ................................ False
[default0]: no_load_optim ................................... None
[default0]: no_load_rng ..................................... None
[default0]: no_persist_layer_norm ........................... False
[default0]: no_pipeline_parallel ............................ True
[default0]: no_save_optim ................................... None
[default0]: no_save_rng ..................................... None
[default0]: normalization ................................... layernorm
[default0]: num_attention_heads ............................. 16
[default0]: num_attention_heads_teacher ..................... None
[default0]: num_channels .................................... 3
[default0]: num_classes ..................................... 1000
[default0]: num_experts ..................................... [128]
[default0]: num_experts_switch .............................. None
[default0]: num_experts_teacher ............................. [1]
[default0]: num_layers ...................................... 24
[default0]: num_layers_per_virtual_pipeline_stage ........... None
[default0]: num_layers_teacher .............................. None
[default0]: num_workers ..................................... 0
[default0]: onnx_safe ....................................... None
[default0]: openai_gelu ..................................... False
[default0]: optimizer ....................................... adam
[default0]: output_bert_embeddings .......................... False
[default0]: overlap_p2p_comm ................................ False
[default0]: override_opt_param_scheduler .................... True
[default0]: params_dtype .................................... torch.float16
[default0]: partition_activations ........................... False
[default0]: patch_dim ....................................... 16
[default0]: perform_initialization .......................... True
[default0]: pipeline_model_parallel_size .................... 1
[default0]: pipeline_model_parallel_split_rank .............. None
[default0]: profile_backward ................................ False
[default0]: query_in_block_prob ............................. 0.1
[default0]: rampup_batch_size ............................... None
[default0]: random_ltd ...................................... False
[default0]: rank ............................................ 0
[default0]: recompute_granularity ........................... None
[default0]: recompute_method ................................ None
[default0]: recompute_num_layers ............................ 1
[default0]: remote_device ................................... none
[default0]: reset_attention_mask ............................ False
[default0]: reset_iteration ................................. False
[default0]: reset_position_ids .............................. False
[default0]: retriever_report_topk_accuracies ................ []
[default0]: retriever_score_scaling ......................... False
[default0]: retriever_seq_length ............................ 256
[default0]: retro_add_retriever ............................. False
[default0]: retro_cyclic_train_iters ........................ None
[default0]: retro_encoder_attention_dropout ................. 0.1
[default0]: retro_encoder_hidden_dropout .................... 0.1
[default0]: retro_encoder_layers ............................ 2
[default0]: retro_num_neighbors ............................. 2
[default0]: retro_num_retrieved_chunks ...................... 2
[default0]: retro_return_doc_ids ............................ False
[default0]: retro_workdir ................................... None
[default0]: return_data_index ............................... False
[default0]: rotary_percent .................................. 1.0
[default0]: sample_rate ..................................... 1.0
[default0]: save ............................................ output/checkpoint/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true
[default0]: save_interval ................................... 100
[default0]: scatter_gather_tensors_in_pipeline .............. True
[default0]: scattered_embeddings ............................ False
[default0]: seed ............................................ 1234
[default0]: seq_length ...................................... 2048
[default0]: sequence_parallel ............................... False
[default0]: sgd_momentum .................................... 0.9
[default0]: short_seq_prob .................................. 0.1
[default0]: skip_train ...................................... False
[default0]: split ........................................... 98,2,0
[default0]: split_transformers .............................. False
[default0]: squared_relu .................................... False
[default0]: standalone_embedding_stage ...................... False
[default0]: start_weight_decay .............................. 0.1
[default0]: swiglu .......................................... False
[default0]: swin_backbone_type .............................. tiny
[default0]: synchronize_each_layer .......................... False
[default0]: tensor_model_parallel_size ...................... 1
[default0]: tensorboard_dir ................................. output/tensorboard/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true_n2dgx01_2023.08.02-17.39.02
[default0]: tensorboard_log_interval ........................ 1
[default0]: tensorboard_queue_size .......................... 1
[default0]: test_data_path .................................. None
[default0]: tile_factor ..................................... 1
[default0]: timing_log_level ................................ 0
[default0]: timing_log_option ............................... minmax
[default0]: titles_data_path ................................ None
[default0]: tokenizer_model ................................. None
[default0]: tokenizer_type .................................. GPT2BPETokenizer
[default0]: topk ............................................ 1
[default0]: train_data_exact_num_epochs ..................... None
[default0]: train_data_path ................................. None
[default0]: train_desc_path ................................. None
[default0]: train_doc_idx_path .............................. None
[default0]: train_idx_path .................................. None
[default0]: train_iters ..................................... 10
[default0]: train_sample_idx_path ........................... None
[default0]: train_samples ................................... None
[default0]: train_shuffle_idx_path .......................... None
[default0]: train_tokens .................................... 300000000000
[default0]: transformer_impl ................................ local
[default0]: transformer_pipeline_model_parallel_size ........ 1
[default0]: untie_embeddings_and_output_weights ............. False
[default0]: use_checkpoint_args ............................. False
[default0]: use_checkpoint_opt_param_scheduler .............. False
[default0]: use_contiguous_buffers_in_local_ddp ............. True
[default0]: use_cpu_initialization .......................... None
[default0]: use_distributed_optimizer ....................... False
[default0]: use_flash_attn .................................. False
[default0]: use_one_sent_docs ............................... False
[default0]: use_pin_memory .................................. False
[default0]: use_ring_exchange_p2p ........................... False
[default0]: use_rotary_position_embeddings .................. False
[default0]: use_tutel ....................................... False
[default0]: valid_data_path ................................. None
[default0]: variable_seq_lengths ............................ False
[default0]: virtual_pipeline_model_parallel_size ............ None
[default0]: vision_backbone_type ............................ vit
[default0]: vision_pretraining .............................. False
[default0]: vision_pretraining_type ......................... classify
[default0]: vocab_extra_ids ................................. 0
[default0]: vocab_file ...................................... data/gpt2-vocab.json
[default0]: vocab_size ...................................... None
[default0]: weight_decay .................................... 0.1
[default0]: weight_decay_incr_style ......................... constant
[default0]: world_size ...................................... 4
[default0]: zero_allgather_bucket_size ...................... 0.0
[default0]: zero_contigious_gradients ....................... False
[default0]: zero_reduce_bucket_size ......................... 0.0
[default0]: zero_reduce_scatter ............................. False
[default0]: zero_stage ...................................... 1.0
[default0]:-------------------- end of arguments ---------------------
@clumsy I think I found the root cause of my issue. The language model implementation expects two boolean values, add_encoder
, add_decoder
:
The GPT implementation doesn't seem to pass those arguments, which means the language model will be built as encoder-only by default: https://github.com/microsoft/Megatron-DeepSpeed/blob/0e9b76563bcbae38ada65b47cdbc35ec0f8816e2/megatron/model/gpt_model.py#L75-L82
Also, while building encoder layer, no expert information is passed: https://github.com/microsoft/Megatron-DeepSpeed/blob/0e9b76563bcbae38ada65b47cdbc35ec0f8816e2/megatron/model/language_model.py#L427-L435
Edit: Looking at: https://github.com/NVIDIA/Megatron-LM/issues/439, the GPT implementation with only encoder layers seems to be fine, as long as causal attention mask is used. However, no experts information is passed to the encoder, hence the value of experts is the default value:
This basically means that all the MoE examples here aren't really building any MoE layers: https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/MoE
Looks like this got fixed here: https://github.com/microsoft/Megatron-DeepSpeed/pull/192
Is there a formula for model size estimation for the mixture of experts based model?
I was looking at the model variants in the following blog, and wondered if there's a formula to compute the model size for the GPT variants: https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale#our-moe-based-nlg-model-architecture