microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.91k stars 345 forks source link

MoE Model size estimation #187

Closed anon-nlp-dev closed 1 year ago

anon-nlp-dev commented 1 year ago

Is there a formula for model size estimation for the mixture of experts based model?

I was looking at the model variants in the following blog, and wondered if there's a formula to compute the model size for the GPT variants: https://www.microsoft.com/en-us/research/blog/deepspeed-advancing-moe-inference-and-training-to-power-next-generation-ai-scale#our-moe-based-nlg-model-architecture

anon-nlp-dev commented 1 year ago

For example, the following lines in training.py, only show the total number of parameters in the underlying dense model, not counting the extra parameters in each MoE layer:

https://github.com/microsoft/Megatron-DeepSpeed/blob/69e3c6a0d6d5af7d38f3311a7e11e40ba8280b34/megatron/training.py#L360-L366

clumsy commented 1 year ago

Consider using DeepSpeed FLOPS profiler where MoE parameter count has been added recently: https://github.com/microsoft/DeepSpeed/pull/3720

anon-nlp-dev commented 1 year ago

Consider using DeepSpeed FLOPS profiler where MoE parameter count has been added recently: microsoft/DeepSpeed#3720

@clumsy Thank you for responding to my question. But, it seems like the code changes made in the PR you mentioned, do not work anymore. I guess because of this particular logic:

https://github.com/microsoft/DeepSpeed/pull/3720/files#diff-7d85ad19fbb2e2dccdb3d35e66dab773d492f7d73ef52f51e00088cd8d140ca2R166-R174

When I tried to debug and see which group_name does each parameter has, I get this: AttributeError: 'Parameter' object has no attribute 'group_name'

clumsy commented 1 year ago

Which DeepSpeed version are you using, @anon-nlp-dev? group_name is accessed only after getattr check - so there should be no AttributeError error. Can you also please attach the traceback?

anon-nlp-dev commented 1 year ago

Which DeepSpeed version are you using, @anon-nlp-dev? group_name is accessed only after getattr check - so there should be no AttributeError error. Can you also please attach the traceback?

@clumsy I am using DeepSpeed Version: 0.10.0, AttributeError is not something the source code throws on its own, I got the error when I manually try to print the param.group_name. You are right, I should have used getattr instead.

I reran the experiment with the getattr change, and got empty string for all the group_name in each of the param instances.

I used the following command:

python -u -m torch.distributed.run --nproc_per_node 4 --nnodes 1 --rdzv_id=12345 --rdzv_endpoint 0.0.0.0:6000 --rdzv_backend c10d --max_restarts 0 --tee 3 pretrain_gpt.py --override-opt_param-scheduler --adam-beta1 0.9 --adam-beta2 0.95 --tensor-model-parallel-size 1 --moe-expert-parallel-size 4 --num-experts 128 --moe-loss-coeff 0.01 --moe-train-capacity-factor 1.0 --moe-eval-capacity-factor 1.0 --moe-min-capacity 4 --init-method-std 0.01 --lr-decay-tokens 300000000000 --lr-warmup-tokens 375000000 --micro-batch-size 2 --exit-duration-in-mins 30000000 --global-batch-size 8 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 2048 --max-position-embeddings 2048 --train-tokens 300000000000 --train-iters 10 --lr 2.0e-4 --min-lr 2e-06 --lr-decay-style cosine --split 98,2,0 --log-interval 5 --eval-interval 100 --eval-iters 50 --save-interval 100 --weight-decay 0.1 --clip-grad 1.0 --hysteresis 2 --num-workers 0 --fp16 --load output/checkpoint/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true --save output/checkpoint/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true --tensorboard-queue-size 1 --log-timers-to-tensorboard --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --tensorboard-dir output/tensorboard/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true_n2dgx01_2023.08.02-17.39.02 --checkpoint-activations --create-moe-param-group --vocab-file data/gpt2-vocab.json --merge-file data/gpt2-merges.txt --data-path data/meg-gpt2-oscar-en-10k_text_document --data-impl mmap --deepspeed --deepspeed_config ds_config_gpt_gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true.json --pipeline-model-parallel-size 1 --no-pipeline-parallel --deepspeed-activation-checkpointing

with the following deepspeed config:

{
  "train_batch_size" : 8,
  "train_micro_batch_size_per_gpu": 2,
  "steps_per_print": 5,

  "zero_optimization": {
    "stage": 2
  },

  "gradient_clipping": 1.0,
  "prescale_gradients": false,

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 500,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "initial_scale_power": 11
  },

  "bf16": {
    "enabled": false
  },
  "curriculum_learning": {
    "enabled": false,
    "curriculum_type": "seqlen",
    "min_difficulty": 80,
    "max_difficulty": 2048,
    "schedule_type": "fixed_linear",
    "schedule_config": {
      "total_curriculum_step": 7048872,
      "difficulty_step": 8
    }
  },

  "wall_clock_breakdown" : true,
  "flops_profiler": {
    "enabled": true,
    "profile_step": 5,
    "module_depth": -1,
    "top_modules": 1,
    "detailed": true,
    "output_file": null
  }
}
clumsy commented 1 year ago

@anon-nlp-dev that's probably because the parameters you checked are non-expert.

Expert parameters have group_name set in here.

Does the FLOPS profile print the expert parameters details for you?

anon-nlp-dev commented 1 year ago

@anon-nlp-dev that's probably because the parameters you checked are non-expert.

Expert parameters have group_name set in here.

Does the FLOPS profile print the expert parameters details for you?

@clumsy The FLOPS profile doesn't print the expert parameters. It prints this:

[default0]:-------------------------- DeepSpeed Flops Profiler --------------------------
[default0]:Profile Summary at step 6:
[default0]:Notations:
[default0]:data parallel size (dp_size), model parallel size(mp_size),
[default0]:number of parameters (params), number of multiply-accumulate operations(MACs),
[default0]:number of floating-point operations (flops), floating-point operations per second (FLOPS),
[default0]:fwd latency (forward propagation latency), bwd latency (backward propagation latency),
[default0]:step (weights update latency), iter latency (sum of fwd, bwd and step latency)
[default0]:
[default0]:world size:                                                   4
[default0]:data parallel size:                                           4
[default0]:model parallel size:                                          1
[default0]:batch size per GPU:                                           2
[default0]:params per gpu:                                               355.92 M
[default0]:params of model = params per GPU * mp_size:               355.92 M
[default0]:fwd MACs per GPU:                                             1860.26 GMACs
[default0]:fwd flops per GPU:                                            3723.74 G
[default0]:fwd flops of model = fwd flops per GPU * mp_size:             3723.74 G
[default0]:fwd latency:                                                  457.15 ms
[default0]:fwd FLOPS per GPU = fwd flops per GPU / fwd latency:          8.15 TFLOPS
[default0]:bwd latency:                                                  2.93 s
[default0]:bwd FLOPS per GPU = 2.0 * fwd flops per GPU / bwd latency:    2.54 TFLOPS
[default0]:fwd+bwd FLOPS per GPU = 3.0 * fwd flops per GPU / (fwd+bwd latency):   3.3 TFLOPS
[default0]:step latency:                                                 244.41 ms
[default0]:iter latency:                                                 3.63 s
[default0]:step latency:                                                 244.41 ms
[default0]:iter latency:                                                 3.63 s
[default0]:FLOPS per GPU = 3.0 * fwd flops per GPU / iter latency:       3.08 TFLOPS
[default0]:samples/second:                                               2.20
[default0]:
[default0]:----------------------------- Aggregated Profile per GPU -----------------------------
[default0]:Top 1 modules in terms of params, MACs or fwd latency at different model depths:
[default0]:depth 0:
[default0]:    params      - {'GPTModel': '355.92 M'}
[default0]:    MACs        - {'GPTModel': '1860.26 GMACs'}
[default0]:    fwd latency - {'GPTModel': '457.02 ms'}
[default0]:depth 1:
[default0]:    params      - {'TransformerLanguageModel': '355.92 M'}
[default0]:    MACs        - {'TransformerLanguageModel': '1649.27 GMACs'}
[default0]:    fwd latency - {'TransformerLanguageModel': '448.68 ms'}
[default0]:depth 2:
[default0]:    params      - {'ParallelTransformer': '302.31 M'}
[default0]:    MACs        - {'ParallelTransformer': '1649.27 GMACs'}
[default0]:    fwd latency - {'ParallelTransformer': '448.08 ms'}
[default0]:depth 3:
[default0]:    params      - {'ModuleList': '302.31 M'}
[default0]:    MACs        - {'ModuleList': '1649.27 GMACs'}
[default0]:    fwd latency - {'ModuleList': '446.24 ms'}
[default0]:depth 4:
[default0]:    params      - {'ParallelTransformerLayer': '302.31 M'}
[default0]:    MACs        - {'ParallelTransformerLayer': '1649.27 GMACs'}
[default0]:    fwd latency - {'ParallelTransformerLayer': '446.24 ms'}
[default0]:depth 5:
[default0]:    params      - {'ParallelMLP': '201.45 M'}
[default0]:    MACs        - {'ParallelAttention': '824.63 GMACs'}
[default0]:    fwd latency - {'ParallelAttention': '271.74 ms'}
[default0]:depth 6:
[default0]:    params      - {'ColumnParallelLinear': '176.33 M'}
[default0]:    MACs        - {'ColumnParallelLinear': '721.55 GMACs'}
[default0]:    fwd latency - {'CoreAttention': '148.9 ms'}
[default0]:
[default0]:------------------------------ Detailed Profile per GPU ------------------------------
[default0]:Each module profile is listed after its name in the following order:
[default0]:params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS
[default0]:
[default0]:Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make u$
[default0]:2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
[default0]:3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepS$
[default0]:
[default0]:GPTModel(
[default0]:  355.92 M, 100.00% Params, 1860.26 GMACs, 100.00% MACs, 457.02 ms, 100.00% latency, 8.15 TFLOPS,
[default0]:  (language_model): TransformerLanguageModel(
[default0]:    355.92 M, 100.00% Params, 1649.27 GMACs, 88.66% MACs, 448.68 ms, 98.18% latency, 7.36 TFLOPS,
[default0]:  (language_model): TransformerLanguageModel(
[default0]:    355.92 M, 100.00% Params, 1649.27 GMACs, 88.66% MACs, 448.68 ms, 98.18% latency, 7.36 TFLOPS,
[default0]:    (embedding): Embedding(
[default0]:      53.61 M, 15.06% Params, 0 MACs, 0.00% MACs, 532.15 us, 0.12% latency, 0.0 FLOPS,
[default0]:      (word_embeddings): VocabParallelEmbedding(51.51 M, 14.47% Params, 0 MACs, 0.00% MACs, 163.79 us, 0.04% latency, 0.0 FLOPS, )
[default0]:      (position_embeddings): Embedding(2.1 M, 0.59% Params, 0 MACs, 0.00% MACs, 123.74 us, 0.03% latency, 0.0 FLOPS, 2048, 1024)
[default0]:      (embedding_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 79.15 us, 0.02% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:    )
[default0]:    (encoder): ParallelTransformer(
[default0]:      302.31 M, 84.94% Params, 1649.27 GMACs, 88.66% MACs, 448.08 ms, 98.04% latency, 7.37 TFLOPS,
[default0]:      (layers): ModuleList(
[default0]:        302.31 M, 84.94% Params, 1649.27 GMACs, 88.66% MACs, 446.24 ms, 97.64% latency, 7.4 TFLOPS,
[default0]:        (0): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 61.73 ms, 13.51% latency, 2.23 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 117.3 us, 0.03% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 60.15 ms, 13.16% latency, 1.14 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 354.53 us, 0.08% latency, 72.69 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:          0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 59.5 ms, 13.02% latency, 579.76 GFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 58.0 ms, 12.69% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 634.91 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 190.5 us, 0.04% latency, 45.09 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 92.51 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 960.35 us, 0.21% latency, 71.56 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 254.39 us, 0.06% latency, 135.07 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 305.18 us, 0.07% latency, 112.59 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (1): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.8 ms, 0.83% latency, 36.19 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.92 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.38 ms, 0.52% latency, 28.92 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 302.79 us, 0.07% latency, 85.11 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.01 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 437.74 us, 0.10% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 594.14 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 156.88 us, 0.03% latency, 54.76 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 84.64 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 922.44 us, 0.20% latency, 74.5 TFLOPS,
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 922.44 us, 0.20% latency, 74.5 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 250.1 us, 0.05% latency, 137.38 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.49 us, 0.06% latency, 117.07 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (2): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 4.03 ms, 0.88% latency, 34.17 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 84.4 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.6 ms, 0.57% latency, 26.53 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.74 us, 0.07% latency, 86.26 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 2.04 ms, 0.45% latency, 16.9 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.29 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 811.58 us, 0.18% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 156.16 us, 0.03% latency, 55.01 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 922.2 us, 0.20% latency, 74.52 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 250.34 us, 0.05% latency, 137.25 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 294.45 us, 0.06% latency, 116.69 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (3): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.78 ms, 0.83% latency, 36.35 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.3 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.37 ms, 0.52% latency, 29.08 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.26 us, 0.07% latency, 86.4 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.82 ms, 0.40% latency, 18.98 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.77 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 613.69 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 153.54 us, 0.03% latency, 55.95 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 91.55 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 918.39 us, 0.20% latency, 74.83 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.39 us, 0.05% latency, 137.78 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.02 us, 0.06% latency, 117.26 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (4): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.8 ms, 0.83% latency, 36.22 TFLOPS,
[default0]:        (4): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.8 ms, 0.83% latency, 36.22 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.73 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.37 ms, 0.52% latency, 29.05 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 304.94 us, 0.07% latency, 84.51 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.06 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.53 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 600.58 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 156.4 us, 0.03% latency, 54.92 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.21 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 927.45 us, 0.20% latency, 74.1 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 251.77 us, 0.06% latency, 136.47 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.3 us, 0.06% latency, 117.55 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (5): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 50.69 ms, 11.09% latency, 2.71 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.25 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 49.22 ms, 10.77% latency, 1.4 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 297.07 us, 0.07% latency, 86.75 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 48.64 ms, 10.64% latency, 709.15 GFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 47.18 ms, 10.32% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 634.91 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 175.48 us, 0.04% latency, 48.95 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 90.36 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 943.42 us, 0.21% latency, 72.84 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 258.45 us, 0.06% latency, 132.95 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 298.26 us, 0.07% latency, 115.2 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (6): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.81 ms, 0.83% latency, 36.13 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.02 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.39 ms, 0.52% latency, 28.77 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 309.47 us, 0.07% latency, 83.27 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.82 ms, 0.40% latency, 18.93 TFLOPS,
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.82 ms, 0.40% latency, 18.93 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 432.01 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 604.87 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.21 us, 0.03% latency, 55.34 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 921.73 us, 0.20% latency, 74.56 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.86 us, 0.05% latency, 137.51 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.54 us, 0.06% latency, 117.45 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (7): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.83 ms, 0.84% latency, 35.96 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.82 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.42 ms, 0.53% latency, 28.49 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 302.55 us, 0.07% latency, 85.17 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.86 ms, 0.41% latency, 18.57 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 424.86 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 633.48 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 157.59 us, 0.03% latency, 54.51 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 916.72 us, 0.20% latency, 74.96 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 247.72 us, 0.05% latency, 138.71 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 291.82 us, 0.06% latency, 117.74 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (8): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 32.06 ms, 7.02% latency, 4.29 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.25 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 30.61 ms, 6.70% latency, 2.25 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 28.44 ms, 6.22% latency, 906.14 GFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.89 ms, 0.41% latency, 18.23 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 439.64 us, 0.10% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 644.45 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 168.8 us, 0.04% latency, 50.89 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 86.78 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 86.78 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 930.31 us, 0.20% latency, 73.87 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 250.1 us, 0.05% latency, 137.38 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 295.4 us, 0.06% latency, 116.32 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (9): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.85 ms, 0.84% latency, 35.72 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.21 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.44 ms, 0.53% latency, 28.27 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 304.46 us, 0.07% latency, 84.64 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.87 ms, 0.41% latency, 18.41 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 435.59 us, 0.10% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 659.94 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.93 us, 0.03% latency, 55.09 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.21 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 923.4 us, 0.20% latency, 74.42 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.86 us, 0.05% latency, 137.51 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.97 us, 0.06% latency, 116.88 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (10): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 53.96 ms, 11.81% latency, 2.55 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.82 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.41 ms, 0.53% latency, 28.53 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 300.65 us, 0.07% latency, 85.71 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.85 ms, 0.40% latency, 18.69 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 425.34 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 639.44 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.45 us, 0.03% latency, 55.26 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.54 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 51.03 ms, 11.17% latency, 1.35 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.15 us, 0.05% latency, 137.91 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 351.91 us, 0.08% latency, 97.64 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:          )
[default0]:        )
[default0]:        (11): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.93 ms, 0.86% latency, 35.04 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 88.93 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.48 ms, 0.54% latency, 27.78 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 310.42 us, 0.07% latency, 83.02 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.9 ms, 0.42% latency, 18.11 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 444.65 us, 0.10% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 643.49 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 156.64 us, 0.03% latency, 54.84 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.92 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 928.64 us, 0.20% latency, 74.0 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 252.49 us, 0.06% latency, 136.09 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.54 us, 0.06% latency, 117.45 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (12): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.81 ms, 0.83% latency, 36.15 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.3 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.4 ms, 0.52% latency, 28.72 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 301.36 us, 0.07% latency, 85.51 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.83 ms, 0.40% latency, 18.83 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.05 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 607.73 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.93 us, 0.03% latency, 55.09 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.49 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 918.15 us, 0.20% latency, 74.85 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 250.1 us, 0.05% latency, 137.38 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.02 us, 0.06% latency, 117.26 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (13): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 25.87 ms, 5.66% latency, 5.32 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 92.51 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 24.41 ms, 5.34% latency, 2.82 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 318.29 us, 0.07% latency, 80.96 TFLOPS, )
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 24.41 ms, 5.34% latency, 2.82 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 318.29 us, 0.07% latency, 80.96 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.8 ms, 0.39% latency, 19.12 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 423.67 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 608.21 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 22.19 ms, 4.85% latency, 387.19 GFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 85.35 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 933.89 us, 0.20% latency, 73.58 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 253.44 us, 0.06% latency, 135.57 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 293.49 us, 0.06% latency, 117.07 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (14): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.78 ms, 0.83% latency, 36.35 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.25 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.37 ms, 0.52% latency, 29.1 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 303.98 us, 0.07% latency, 84.77 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.1 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.29 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 594.38 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 153.78 us, 0.03% latency, 55.86 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.45 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 918.39 us, 0.20% latency, 74.83 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 251.29 us, 0.05% latency, 136.73 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.3 us, 0.06% latency, 117.55 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (15): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.91 ms, 0.86% latency, 35.14 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.97 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.5 ms, 0.55% latency, 27.53 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.02 us, 0.07% latency, 86.47 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.95 ms, 0.43% latency, 17.72 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 435.83 us, 0.10% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 618.7 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.93 us, 0.03% latency, 55.09 TFLOPS, )
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.93 us, 0.03% latency, 55.09 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 84.16 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 912.67 us, 0.20% latency, 75.3 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 248.91 us, 0.05% latency, 138.04 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.87 us, 0.06% latency, 118.13 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (16): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.79 ms, 0.83% latency, 36.27 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.78 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.38 ms, 0.52% latency, 28.98 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 297.78 us, 0.07% latency, 86.54 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.82 ms, 0.40% latency, 18.91 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 425.34 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 623.7 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 154.26 us, 0.03% latency, 55.69 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 84.88 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 923.63 us, 0.20% latency, 74.4 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 248.67 us, 0.05% latency, 138.17 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 302.79 us, 0.07% latency, 113.48 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (17): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 73.86 ms, 16.16% latency, 1.86 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.35 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.4 ms, 0.53% latency, 28.64 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.5 us, 0.07% latency, 86.33 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.85 ms, 0.40% latency, 18.67 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 424.62 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 638.72 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 158.07 us, 0.03% latency, 54.34 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.92 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 917.2 us, 0.20% latency, 74.92 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 251.05 us, 0.05% latency, 136.86 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.63 us, 0.06% latency, 118.22 TFLOPS, )
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 251.05 us, 0.05% latency, 136.86 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.63 us, 0.06% latency, 118.22 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (18): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 4.02 ms, 0.88% latency, 34.26 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 105.86 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.5 ms, 0.55% latency, 27.49 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 340.22 us, 0.07% latency, 75.74 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.88 ms, 0.41% latency, 18.3 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 442.27 us, 0.10% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 638.25 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 169.28 us, 0.04% latency, 50.74 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 91.55 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 946.28 us, 0.21% latency, 72.62 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.39 us, 0.05% latency, 137.78 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 311.37 us, 0.07% latency, 110.35 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (19): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.78 ms, 0.83% latency, 36.44 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 82.49 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.36 ms, 0.52% latency, 29.14 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 299.22 us, 0.07% latency, 86.12 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.06 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.05 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 603.68 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.69 us, 0.03% latency, 55.17 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 914.57 us, 0.20% latency, 75.14 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.62 us, 0.05% latency, 137.65 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.16 us, 0.06% latency, 118.42 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (20): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 23.72 ms, 5.19% latency, 5.8 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.59 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.59 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.38 ms, 0.52% latency, 28.94 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 299.93 us, 0.07% latency, 85.92 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.83 ms, 0.40% latency, 18.87 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 425.58 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 630.14 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 153.54 us, 0.03% latency, 55.95 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 85.12 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 20.85 ms, 4.56% latency, 3.3 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 20.17 ms, 4.41% latency, 1.7 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 298.74 us, 0.07% latency, 115.02 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (21): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.77 ms, 0.83% latency, 36.45 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.78 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.36 ms, 0.52% latency, 29.15 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 299.69 us, 0.07% latency, 85.99 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.81 ms, 0.40% latency, 19.07 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 426.29 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 595.57 us, 0.13% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 155.21 us, 0.03% latency, 55.34 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.68 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 913.14 us, 0.20% latency, 75.26 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 248.91 us, 0.05% latency, 138.04 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 292.06 us, 0.06% latency, 117.65 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (22): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 3.8 ms, 0.83% latency, 36.22 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.45 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 2.39 ms, 0.52% latency, 28.76 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 298.98 us, 0.07% latency, 86.19 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.84 ms, 0.40% latency, 18.76 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 432.49 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 617.27 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 432.49 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 617.27 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 159.03 us, 0.03% latency, 54.02 TFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 83.21 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 914.1 us, 0.20% latency, 75.18 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 249.62 us, 0.05% latency, 137.65 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 290.39 us, 0.06% latency, 118.32 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:        (23): ParallelTransformerLayer(
[default0]:          12.6 M, 3.54% Params, 68.72 GMACs, 3.69% MACs, 62.85 ms, 13.75% latency, 2.19 TFLOPS,
[default0]:          (input_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 80.59 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (self_attention): ParallelAttention(
[default0]:            4.2 M, 1.18% Params, 34.36 GMACs, 1.85% MACs, 61.44 ms, 13.44% latency, 1.12 TFLOPS,
[default0]:            (query_key_value): ColumnParallelLinear(3.15 M, 0.88% Params, 12.88 GMACs, 0.69% MACs, 309.94 us, 0.07% latency, 83.14 TFLOPS, )
[default0]:            (core_attention): CoreAttention(
[default0]:              0, 0.00% Params, 17.18 GMACs, 0.92% MACs, 1.85 ms, 0.41% latency, 18.62 TFLOPS,
[default0]:              (scale_mask_softmax): FusedScaleMaskSoftmax(0, 0.00% Params, 0 MACs, 0.00% MACs, 424.15 us, 0.09% latency, 0.0 FLOPS, )
[default0]:              (attention_dropout): Dropout(0, 0.00% Params, 0 MACs, 0.00% MACs, 642.3 us, 0.14% latency, 0.0 FLOPS, p=0.1, inplace=False)
[default0]:            )
[default0]:            (dense): RowParallelLinear(1.05 M, 0.29% Params, 4.29 GMACs, 0.23% MACs, 59.18 ms, 12.95% latency, 145.15 GFLOPS, )
[default0]:          )
[default0]:          (post_attention_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 85.59 us, 0.02% latency, 0.0 FLOPS, )
[default0]:          (mlp): ParallelMLP(
[default0]:            8.39 M, 2.36% Params, 34.36 GMACs, 1.85% MACs, 916.48 us, 0.20% latency, 74.98 TFLOPS,
[default0]:            (dense_h_to_4h): ColumnParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 252.49 us, 0.06% latency, 136.09 TFLOPS, )
[default0]:            (dense_4h_to_h): RowParallelLinear(4.2 M, 1.18% Params, 17.18 GMACs, 0.92% MACs, 289.2 us, 0.06% latency, 118.81 TFLOPS, )
[default0]:          )
[default0]:        )
[default0]:      )
[default0]:      (final_layernorm): MixedFusedLayerNorm(2.05 k, 0.00% Params, 0 MACs, 0.00% MACs, 81.54 us, 0.02% latency, 0.0 FLOPS, )
[default0]:    )
[default0]:  )
[default0]:)
[default0]:------------------------------------------------------------------------------
clumsy commented 1 year ago

@anon-nlp-dev that's not surprising, look at the model layers breakdown. Your model is fully dense as it only has ParallelMLP, no MoE layers.

anon-nlp-dev commented 1 year ago

@clumsy Yes, that seems to be the case. I wonder what did I get wrong for the model to not configure the MoE layers, I have the following argument values:

[default0]:------------------------ arguments ------------------------
[default0]:  accumulate_allreduce_grads_in_fp32 .............. False
[default0]:  adam_beta1 ...................................... 0.9
[default0]:  adam_beta2 ...................................... 0.95
[default0]:  adam_eps ........................................ 1e-08
[default0]:  add_bias_linear ................................. True
[default0]:  add_position_embedding .......................... True
[default0]:  adlr_autoresume ................................. False
[default0]:  adlr_autoresume_interval ........................ 1000
[default0]:  aml_data_download_path .......................... None
[default0]:  apply_layernorm_1p .............................. False
[default0]:  apply_query_key_layer_scaling ................... True
[default0]:  apply_residual_connection_post_layernorm ........ False
[default0]:  async_tensor_model_parallel_allreduce ........... False
[default0]:  attention_dropout ............................... 0.1
[default0]:  attention_softmax_in_fp32 ....................... False
[default0]:  barrier_with_L1_time ............................ True
[default0]:  bert_binary_head ................................ True
[default0]:  bert_embedder_type .............................. megatron
[default0]:  bert_load ....................................... None
[default0]:  bf16 ............................................ False
[default0]:  bias_dropout_fusion ............................. True
[default0]:  bias_gelu_fusion ................................ True
[default0]:  biencoder_projection_dim ........................ 0
[default0]:  biencoder_shared_query_context_model ............ False
[default0]:  block_data_path ................................. None
[default0]:  checkpoint_activations .......................... True
[default0]:  checkpoint_in_cpu ............................... False
[default0]:  checkpoint_num_layers ........................... 1
[default0]:  classes_fraction ................................ 1.0
[default0]:  clip_grad ....................................... 1.0
[default0]:  compression_training ............................ False
[default0]:  consumed_train_samples .......................... 0
[default0]:  consumed_train_tokens ........................... 0
[default0]:  consumed_valid_samples .......................... 0
[default0]:  contigious_checkpointing ........................ False
[default0]:  cpu_optimizer ................................... False
[default0]:  cpu_torch_adam .................................. False
[default0]:  create_moe_param_group .......................... True
[default0]:  curriculum_learning_legacy ...................... False
[default0]:  data_cache_path ................................. None
[default0]:  data_efficiency_curriculum_learning ............. False
[default0]:  data_impl ....................................... mmap
[default0]:  data_parallel_random_init ....................... False
[default0]:  data_parallel_size .............................. 4
[default0]:  data_path ....................................... ['data/meg-gpt2-oscar-en-10k_text_document']
[default0]:  data_per_class_fraction ......................... 1.0
[default0]:  data_sharding ................................... True
[default0]:  dataloader_type ................................. single
[default0]:  DDP_impl ........................................ local
[default0]:  decoder_num_layers .............................. None
[default0]:  decoder_seq_length .............................. None
[default0]:  deepscale ....................................... False
[default0]:  deepscale_config ................................ None
[default0]:  deepspeed ....................................... True
[default0]:  deepspeed_activation_checkpointing .............. True
[default0]:  deepspeed_config ................................ ds_config_gpt_gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true.json
[default0]:  deepspeed_mpi ................................... False
[default0]:  dino_bottleneck_size ............................ 256
[default0]:  dino_freeze_last_layer .......................... 1
[default0]:  dino_head_hidden_size ........................... 2048
[default0]:  dino_local_crops_number ......................... 10
[default0]:  dino_local_img_size ............................. 96
[default0]:  dino_norm_last_layer ............................ False
[default0]:  dino_teacher_temp ............................... 0.07
[default0]:  dino_warmup_teacher_temp ........................ 0.04
[default0]:  dino_warmup_teacher_temp_epochs ................. 30
[default0]:  distribute_checkpointed_activations ............. False
[default0]:  distribute_saved_activations .................... False
[default0]:  distributed_backend ............................. nccl
[default0]:  distributed_timeout_minutes ..................... 10
[default0]:  ds_inference .................................... False
[default0]:  ds_pipeline_enabled ............................. False
[default0]:  embedding_path .................................. None
[default0]:  embedding_weights_in_fp32 ....................... False
[default0]:  empty_unused_memory_level ....................... 0
[default0]:  enable_expert_tensor_parallelism ................ False
[default0]:  encoder_num_layers .............................. 24
[default0]:  encoder_seq_length .............................. 2048
[default0]:  end_weight_decay ................................ 0.1
[default0]:  eod_mask_loss ................................... False
[default0]:  eval_interval ................................... 100
[default0]:  eval_iters ...................................... 50
[default0]:  evidence_data_path .............................. None
[default0]:  exit_duration_in_mins ........................... 30000000
[default0]:  exit_interval ................................... None
[default0]:  exit_on_missing_checkpoint ...................... False
[default0]:  exit_signal_handler ............................. False
[default0]:  expert_interval ................................. 2
[default0]:  ffn_hidden_size ................................. 4096
[default0]:  finetune ........................................ False
[default0]:  fp16 ............................................ True
[default0]:  fp16_lm_cross_entropy ........................... False
[default0]:  fp32_residual_connection ........................ False
[default0]:  fp8_amax_compute_algo ........................... most_recent
[default0]:  fp8_amax_history_len ............................ 1
[default0]:  fp8_e4m3 ........................................ False
[default0]:  fp8_hybrid ...................................... False
[default0]:  fp8_interval .................................... 1
[default0]:  fp8_margin ...................................... 0
[default0]:  fp8_wgrad ....................................... True
[default0]:  global_batch_size ............................... 8
[default0]:  gradient_accumulation_fusion .................... True
[default0]:  head_lr_mult .................................... 1.0
[default0]:  hidden_dropout .................................. 0.1
[default0]:  hidden_size ..................................... 1024
[default0]:  hidden_size_teacher ............................. None
[default0]:  hysteresis ...................................... 2
[default0]:  ict_head_size ................................... None
[default0]:  ict_load ........................................ None
[default0]:  img_h ........................................... 224
[default0]:  img_w ........................................... 224
[default0]:  indexer_batch_size .............................. 128
[default0]:  indexer_log_interval ............................ 1000
[default0]:  inference ....................................... False
[default0]:  inference_batch_times_seqlen_threshold .......... 512
[default0]:  init_method_std ................................. 0.01
[default0]:  init_method_xavier_uniform ...................... False
[default0]:  initial_loss_scale .............................. 4294967296
[default0]:  iter_per_epoch .................................. 1250
[default0]:  kd .............................................. False
[default0]:  kd_alpha_ce ..................................... 1
[default0]:  kd_beta_ce ...................................... 1
[default0]:  kd_temp ......................................... 1.0
[default0]:  kv_channels ..................................... 64
[default0]:  layernorm_epsilon ............................... 1e-05
[default0]:  lazy_mpu_init ................................... None
[default0]:  load ............................................ output/checkpoint/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true
[default0]:  load_teacher .................................... None
[default0]:  local_rank ...................................... None
[default0]:  log_batch_size_to_tensorboard ................... True
[default0]:  log_interval .................................... 5
[default0]:  log_learning_rate_to_tensorboard ................ True
[default0]:  log_loss_scale_to_tensorboard ................... True
[default0]:  log_memory_to_tensorboard ....................... False
[default0]:  log_num_zeros_in_grad ........................... False
[default0]:  log_optimizer_states_to_tensorboard ............. False
[default0]:  log_params_norm ................................. False
[default0]:  log_timers_to_tensorboard ....................... True
[default0]:  log_validation_ppl_to_tensorboard ............... True
[default0]:  log_world_size_to_tensorboard ................... False
[default0]:  loss_scale ...................................... None
[default0]:  loss_scale_window ............................... 1000
[default0]:  lr .............................................. 0.0002
[default0]:  lr_decay_iters .................................. None
[default0]:  lr_decay_samples ................................ None
[default0]:  lr_decay_style .................................. cosine
[default0]:  lr_decay_tokens ................................. 300000000000
[default0]:  lr_warmup_fraction .............................. None
[default0]:  lr_warmup_iters ................................. 0
[default0]:  lr_warmup_samples ............................... 0
[default0]:  lr_warmup_tokens ................................ 375000000
[default0]:  make_vocab_size_divisible_by .................... 128
[default0]:  mask_factor ..................................... 1.0
[default0]:  mask_prob ....................................... 0.15
[default0]:  mask_type ....................................... random
[default0]:  masked_softmax_fusion ........................... True
[default0]:  max_position_embeddings ......................... 2048
[default0]:  max_tokens_to_oom ............................... 12000
[default0]:  memory_centric_tiled_linear ..................... False
[default0]:  merge_file ...................................... data/gpt2-merges.txt
[default0]:  micro_batch_size ................................ 2
[default0]:  min_loss_scale .................................. 1.0
[default0]:  min_lr .......................................... 2e-06
[default0]:  mlp_type ........................................ standard
[default0]:  mmap_warmup ..................................... False
[default0]:  moe_eval_capacity_factor ........................ 1.0
[default0]:  moe_expert_parallel_size ........................ 4
[default0]:  moe_loss_coeff .................................. 0.01
[default0]:  moe_min_capacity ................................ 4
[default0]:  moe_token_dropping .............................. True
[default0]:  moe_train_capacity_factor ....................... 1.0
[default0]:  mos ............................................. False
[default0]:  no_load_lr_state ................................ False
[default0]:  no_load_optim ................................... None
[default0]:  no_load_rng ..................................... None
[default0]:  no_persist_layer_norm ........................... False
[default0]:  no_pipeline_parallel ............................ True
[default0]:  no_save_optim ................................... None
[default0]:  no_save_rng ..................................... None
[default0]:  normalization ................................... layernorm
[default0]:  num_attention_heads ............................. 16
[default0]:  num_attention_heads_teacher ..................... None
[default0]:  num_channels .................................... 3
[default0]:  num_classes ..................................... 1000
[default0]:  num_experts ..................................... [128]
[default0]:  num_experts_switch .............................. None
[default0]:  num_experts_teacher ............................. [1]
[default0]:  num_layers ...................................... 24
[default0]:  num_layers_per_virtual_pipeline_stage ........... None
[default0]:  num_layers_teacher .............................. None
[default0]:  num_workers ..................................... 0
[default0]:  onnx_safe ....................................... None
[default0]:  openai_gelu ..................................... False
[default0]:  optimizer ....................................... adam
[default0]:  output_bert_embeddings .......................... False
[default0]:  overlap_p2p_comm ................................ False
[default0]:  override_opt_param_scheduler .................... True
[default0]:  params_dtype .................................... torch.float16
[default0]:  partition_activations ........................... False
[default0]:  patch_dim ....................................... 16
[default0]:  perform_initialization .......................... True
[default0]:  pipeline_model_parallel_size .................... 1
[default0]:  pipeline_model_parallel_split_rank .............. None
[default0]:  profile_backward ................................ False
[default0]:  query_in_block_prob ............................. 0.1
[default0]:  rampup_batch_size ............................... None
[default0]:  random_ltd ...................................... False
[default0]:  rank ............................................ 0
[default0]:  recompute_granularity ........................... None
[default0]:  recompute_method ................................ None
[default0]:  recompute_num_layers ............................ 1
[default0]:  remote_device ................................... none
[default0]:  reset_attention_mask ............................ False
[default0]:  reset_iteration ................................. False
[default0]:  reset_position_ids .............................. False
[default0]:  retriever_report_topk_accuracies ................ []
[default0]:  retriever_score_scaling ......................... False
[default0]:  retriever_seq_length ............................ 256
[default0]:  retro_add_retriever ............................. False
[default0]:  retro_cyclic_train_iters ........................ None
[default0]:  retro_encoder_attention_dropout ................. 0.1
[default0]:  retro_encoder_hidden_dropout .................... 0.1
[default0]:  retro_encoder_layers ............................ 2
[default0]:  retro_num_neighbors ............................. 2
[default0]:  retro_num_retrieved_chunks ...................... 2
[default0]:  retro_return_doc_ids ............................ False
[default0]:  retro_workdir ................................... None
[default0]:  return_data_index ............................... False
[default0]:  rotary_percent .................................. 1.0
[default0]:  sample_rate ..................................... 1.0
[default0]:  save ............................................ output/checkpoint/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true
[default0]:  save_interval ................................... 100
[default0]:  scatter_gather_tensors_in_pipeline .............. True
[default0]:  scattered_embeddings ............................ False
[default0]:  seed ............................................ 1234
[default0]:  seq_length ...................................... 2048
[default0]:  sequence_parallel ............................... False
[default0]:  sgd_momentum .................................... 0.9
[default0]:  short_seq_prob .................................. 0.1
[default0]:  skip_train ...................................... False
[default0]:  split ........................................... 98,2,0
[default0]:  split_transformers .............................. False
[default0]:  squared_relu .................................... False
[default0]:  standalone_embedding_stage ...................... False
[default0]:  start_weight_decay .............................. 0.1
[default0]:  swiglu .......................................... False
[default0]:  swin_backbone_type .............................. tiny
[default0]:  synchronize_each_layer .......................... False
[default0]:  tensor_model_parallel_size ...................... 1
[default0]:  tensorboard_dir ................................. output/tensorboard/gpt-0.35B-lr-2.0e-4-minlr-2e-06-bs-8-gpus-4-mp-1-pp-1-ep-128-mlc-0.01-cap-1.0-drop-true_n2dgx01_2023.08.02-17.39.02
[default0]:  tensorboard_log_interval ........................ 1
[default0]:  tensorboard_queue_size .......................... 1
[default0]:  test_data_path .................................. None
[default0]:  tile_factor ..................................... 1
[default0]:  timing_log_level ................................ 0
[default0]:  timing_log_option ............................... minmax
[default0]:  titles_data_path ................................ None
[default0]:  tokenizer_model ................................. None
[default0]:  tokenizer_type .................................. GPT2BPETokenizer
[default0]:  topk ............................................ 1
[default0]:  train_data_exact_num_epochs ..................... None
[default0]:  train_data_path ................................. None
[default0]:  train_desc_path ................................. None
[default0]:  train_doc_idx_path .............................. None
[default0]:  train_idx_path .................................. None
[default0]:  train_iters ..................................... 10
[default0]:  train_sample_idx_path ........................... None
[default0]:  train_samples ................................... None
[default0]:  train_shuffle_idx_path .......................... None
[default0]:  train_tokens .................................... 300000000000
[default0]:  transformer_impl ................................ local
[default0]:  transformer_pipeline_model_parallel_size ........ 1
[default0]:  untie_embeddings_and_output_weights ............. False
[default0]:  use_checkpoint_args ............................. False
[default0]:  use_checkpoint_opt_param_scheduler .............. False
[default0]:  use_contiguous_buffers_in_local_ddp ............. True
[default0]:  use_cpu_initialization .......................... None
[default0]:  use_distributed_optimizer ....................... False
[default0]:  use_flash_attn .................................. False
[default0]:  use_one_sent_docs ............................... False
[default0]:  use_pin_memory .................................. False
[default0]:  use_ring_exchange_p2p ........................... False
[default0]:  use_rotary_position_embeddings .................. False
[default0]:  use_tutel ....................................... False
[default0]:  valid_data_path ................................. None
[default0]:  variable_seq_lengths ............................ False
[default0]:  virtual_pipeline_model_parallel_size ............ None
[default0]:  vision_backbone_type ............................ vit
[default0]:  vision_pretraining .............................. False
[default0]:  vision_pretraining_type ......................... classify
[default0]:  vocab_extra_ids ................................. 0
[default0]:  vocab_file ...................................... data/gpt2-vocab.json
[default0]:  vocab_size ...................................... None
[default0]:  weight_decay .................................... 0.1
[default0]:  weight_decay_incr_style ......................... constant
[default0]:  world_size ...................................... 4
[default0]:  zero_allgather_bucket_size ...................... 0.0
[default0]:  zero_contigious_gradients ....................... False
[default0]:  zero_reduce_bucket_size ......................... 0.0
[default0]:  zero_reduce_scatter ............................. False
[default0]:  zero_stage ...................................... 1.0
[default0]:-------------------- end of arguments ---------------------
anon-nlp-dev commented 1 year ago

@clumsy I think I found the root cause of my issue. The language model implementation expects two boolean values, add_encoder, add_decoder:

https://github.com/microsoft/Megatron-DeepSpeed/blob/0e9b76563bcbae38ada65b47cdbc35ec0f8816e2/megatron/model/language_model.py#L51-L56

The GPT implementation doesn't seem to pass those arguments, which means the language model will be built as encoder-only by default: https://github.com/microsoft/Megatron-DeepSpeed/blob/0e9b76563bcbae38ada65b47cdbc35ec0f8816e2/megatron/model/gpt_model.py#L75-L82

Also, while building encoder layer, no expert information is passed: https://github.com/microsoft/Megatron-DeepSpeed/blob/0e9b76563bcbae38ada65b47cdbc35ec0f8816e2/megatron/model/language_model.py#L427-L435

Edit: Looking at: https://github.com/NVIDIA/Megatron-LM/issues/439, the GPT implementation with only encoder layers seems to be fine, as long as causal attention mask is used. However, no experts information is passed to the encoder, hence the value of experts is the default value:

https://github.com/microsoft/Megatron-DeepSpeed/blob/0e9b76563bcbae38ada65b47cdbc35ec0f8816e2/megatron/model/transformer.py#L1364-L1376

This basically means that all the MoE examples here aren't really building any MoE layers: https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/MoE

anon-nlp-dev commented 1 year ago

Looks like this got fixed here: https://github.com/microsoft/Megatron-DeepSpeed/pull/192