kojimano / Megatron-DeepSpeed-ABCI

Other
5 stars 2 forks source link

ABCI GPU ベンチマーク #16

Closed kojimano closed 1 year ago

kojimano commented 1 year ago

Model benchmarking results

Overview

Model hyperparameters

Notations

Preliminary Experiments

#GPUs #Layers DP MP PP MBS GBS SL AC Max Mem (allocated) Max Mem (reserved) Sec/it TFLOPs Notes
4 4 1 2 2 1 8 1024 Yes 8584 MiB 9936 MiB 0.5 45.63 4/28
4 4 1 2 2 1 8 1024 No 8585 MiB 10278 MiB 0.44 45.09 4/28
4 2 1 2 2 1 8 2048 Yes 4458 MiB 5336 MiB 0.6 47.8 4/28
4 4 1 2 2 1 8 2048 Yes 8525 MiB 10142 MiB 0.97 51.64 4/28
4 4 1 1 4 1 8 2048 Yes 6057 / 10970 MiB (OOM) 7980 / 13278 MiB (OOM) - 43.7 4/28
4 2 1 4 1 1 8 2048 Yes 4458 MiB 4458 MiB 0.6 47.5 4/28
4 4 1 4 1 1 8 2048 No 7462 MiB 9236 MiB 0.8 44.6 4/28
4 4 1 4 1 1 8 2048 Yes 7463 MiB 8134 MiB 1.0 47.0 4/28
4 4 1 4 1 2 8 2048 Yes 7462 MiB 8528 MiB 0.8 60.9 4/28
4 4 1 4 1 4 8 2048 Yes 7479 MiB 8890 MiB 0.8 60.9 4/28
4 4 1 4 1 4 8 2048 No 11793 MiB 13516 MiB 0.6 57.9 4/28
4 6 4 1 1 1 8 2048 Yes 10467 MiB (OOM) 11272 MiB (OOM) - - 4/28

Memory usages seems to increase after logging?

Experiments-1

#GPUs Size DP MP PP MBS GBS SL AC Zero Max Mem (allocated) Max Mem (reserved) TFLOPs Sec/it Est. Aggr. PetaFLOPs B tokens Notes
32 10B 1 4 8 1 90 1024 No 1 OOM MiB OOM MiB - - - - 4/28
32 10B 1 4 8 1 90 2048 Yes 1 - MiB - MiB 39.3 12.4 - 152 4/28
32 10B 1 4 8 2 90 2048 Yes 1 7875 MiB 8892 MiB 40.1 12.2 - 155 4/28
32 10B 1 4 8 4 90 2048 Yes 1 - MiB - MiB - - - - 4/28
32 13B 1 4 8 1 8 2048 Yes 1 7568 MiB 8586 MiB 23.5 2.3 - - 4/28
32 13B 1 4 8 1 512 2048 Yes 1 8966 MiB 10100 MiB 42.7 83.5 - - 4/28
32 13B 1 4 8 1 90 1024 No 1 OOM MiB OOM MiB - - - - 4/28
32 13B 1 4 8 1 90 2048 Yes 1 8964 MiB 10124 MiB 40.0 15.4 - 123 4/28
32 13B 1 4 8 2 90 2048 Yes 1 9303 MiB 10648 MiB 48.7 12.8 - 148 4/28
32 13B 1 4 8 4 88 2048 Yes 1 12243 MiB (OOM) 14108 MiB (OOM) 44.2 13.8 - - 4/28

Deepspeed (Reduce PP bubble / disable activation checkpoints)

#GPUs #Layers DP MP PP MBS GBS AC Zero Max Mem (allocated) Max Mem (reserved) TFLOPs Sec/it B tokens Notes
32 10 4 1 1 1 88 Yes None 7540 MiB 9116 MiB 43.2 1.2 - 5/2
32 10 4 1 1 1 88 Yes 1 5050 MiB - MiB 43.1 - 1.2 5/2
32 10 4 1 1 1 88 Yes 2 5490 MiB - MiB 42.9 1.2 - 5/2

Activation Partitioning and Activation Checkpointing Chunks

#GPUs Size DP MP PP MBS GBS AC AC chunk DAC Max Mem (allocated) Max Mem (reserved) TFLOPs Sec/it Notes
4 10B (6 layers) 1 4 1 1 88 2048 No - No 7758 MiB 8732.MiB 46.25 8.5 4/28
4 10B (6 layers) 1 4 1 2 88 2048 No - No 10614 MiB (OOM) 11858 MiB (OOM) 49.75 7.9 4/28
4 10B (6 layers) 1 4 1 1 88 2048 Yes 1 No 6931 MiB 7162 MiB 46.36 11.3 4/28
4 10B (6 layers) 1 4 1 2 88 2048 Yes 1 No 6931 MiB 7538 MiB 50.33 10.4 4/28
4 10B (6 layers) 1 4 1 2 88 2048 Yes 1 Yes 6979 MiB 7242 MiB 49.9 10.5 4/28
4 10B (6 layers) 1 4 1 4 88 2048 Yes 1 Yes 7027 MiB 8808 MiB 53.05 9.9 4/28
4 10B (6 layers) 1 4 1 8 88 2048 Yes 1 Yes 7124 MiB 10592 MiB 53.26 9.9 4/28
4 10B (6 layers) 1 4 1 2 88 2048 Yes 2 Yes - MiB - MiB - - bug did not work ...

Notes

kojimano commented 1 year ago

Interleaved Pipeline + Scatter Gather Ops

#GPUs Size DP MP PP MBS GBS SL Scattered Interleaved AC/DAC Max Mem (allocated) Max Mem (reserved) TFLOPs Sec/it Notes
32 10.1B 1 4 8 1 88 2048 Yes No Yes 7122 MiB 7370 MiB 39.4 12.1 4/28
32 10.1B 1 4 8 2 88 2048 Yes No Yes 7251 MiB 7620 MiB 40.2 11.8 4/28
32 10.1B 1 4 8 4 88 2048 Yes No Yes 7731 MiB 9296 MiB 37.2 12.8 4/28
32 10.1B 1 4 8 8 88 2048 Yes No Yes 7698 MiB 10056 MiB 37.3 12.8 4/28
32 10.1B 1 4 8 2 80 2048 Yes 3 Yes 7347 MiB 7938 MiB 41.3 10.5 4/28
32 10.1B 1 4 8 1 96 2048 Yes 2 Yes 8532 MiB 9020 MiB 41.5 12.5 4/28
32 10.1B 1 4 8 1 96 2048 No 2 Yes 8532 MiB 9020 MiB 35.8 14.5 4/28
32 10.1B 1 4 8 2 96 2048 Yes 2 Yes 7107 MiB 7782 MiB 38.7 13.4 4/28
32 13 B 1 4 8 2 96 2048 Yes 2 Yes - MiB - MiB - - 4/28