stas00 / ml-engineering

Machine Learning Engineering Open Book
https://stasosphere.com/machine-learning/
Creative Commons Attribution Share Alike 4.0 International
11.55k stars 703 forks source link

MAMF - GH200 #58

Closed frankschae closed 2 months ago

frankschae commented 2 months ago

Thanks for introducing the new performance metric! I'd like to contribute results for the GH200 chip. I ran the quick run in a docker container. If this looks good to you, I'm happy to run an exhaustive search.

sh runfile_docker.sh 

=============
== PyTorch ==
=============

NVIDIA Release 24.07 (build 100464920)
PyTorch Version 2.4.0a0+3bcc3cd
Container image Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.5 driver version 555.42.06 with kernel driver version 550.54.15.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
      detected.  Multi-node communication performance may be reduced.

Benchmark started on 2024-08-08 06:44:03

** Command line:
/usr/bin/python /workspace/mamf-finder.py --m_range 0 20480 256 --n 4096 --k 4096 --output_file=2024-08-08-06:44:01.txt

** Dtype: torch.bfloat16

** Platform/Device info:
Linux d509c6dead99 6.5.0-1019-nvidia-64k #19-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 12:54:40 UTC 2024 aarch64 aarch64
_CudaDeviceProperties(name='NVIDIA GH200 480GB', major=9, minor=0, total_memory=96768MB, multi_processor_count=132)

** Critical software versions:
torch=2.4.0a0+3bcc3cddb5.nv24.07
cuda=12.5

** Additional notes:

--------------------------------------------------------------------------------

The best outcome was 772.0TFLOPS @ 20224x4096x4096 (MxNxK) (tried 79 shapes)
Elapsed time: 0:00:07
Script executed successfully.

2024-08-08-06:44:01.txt

frankschae commented 2 months ago

this is what I get for the slightly more exhaustive search:

Benchmark started on 2024-08-08 07:11:18

** Command line:
/usr/bin/python /workspace/mamf-finder.py --m_range 0 5376 256 --n_range 0 5376 256 --k_range 0 5376 256 --output_file=2024-08-08-07:11:16.txt

** Dtype: torch.bfloat16

** Platform/Device info:
Linux cc2be45bb63b 6.5.0-1019-nvidia-64k #19-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 12:54:40 UTC 2024 aarch64 aarch64
_CudaDeviceProperties(name='NVIDIA GH200 480GB', major=9, minor=0, total_memory=96768MB, multi_processor_count=132)

** Critical software versions:
torch=2.4.0a0+3bcc3cddb5.nv24.07
cuda=12.5

** Additional notes:

--------------------------------------------------------------------------------

The best outcome was 727.6TFLOPS @ 4864x4096x4352 (MxNxK) (tried 8000 shapes)
Elapsed time: 0:01:52
Script executed successfully.

2024-08-08-07:11:16.txt

stas00 commented 2 months ago

Thank you, @frankschae!

But the 2nd result is worse than the first one. H* performs the best at bigger dimensions - please see: the H100 entry at https://github.com/stas00/ml-engineering/tree/master/compute/accelerator/benchmarks#examples-of-usage - surely GH200 should be at least as fast as H100, which I clocked 792.1@ 6144x17920x2816

frankschae commented 2 months ago

Yup! I guess the second scan is also just not very exhaustive (?) I started a scan with the third setting ./mamf-finder.py --m_range 0 20480 256 --n_range 0 20480 256 --k_range 0 20480 256 --output_file=$(date +"%Y-%m-%d-%H:%M:%S").txt which is still ongoing but where I see a larger number. Currently the last line of the output file is:

192791 |  784.8 TFLOPS @ 7168x19456x16896     | best:  814.6 TFLOPS @ 7168x19456x16896 (MxNxK)
stas00 commented 2 months ago

Much better - H200 has a faster HBM so we should expect higher matmal TFLOPS

When finished if it resonates please make a PR to add a new entry at https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#maximum-achievable-matmul-flops-comparison-table

the theory column is 989, the efficiency col is your number/989*100

yaolu commented 2 months ago

Here's the full search of GH200 result on my side.

best:  821.0 TFLOPS @ 11264x19712x1536 (MxNxK)
frankschae commented 2 months ago

I just checked -- My scan is also done! Here are the numbers from my run:

493039 |  807.5 TFLOPS @ 12288x14336x15872    | best:  831.7 TFLOPS @ 12288x14336x15872 (MxNxK)

The best outcome was 831.7TFLOPS @ 12288x14336x15872 (MxNxK) (tried 493039 shapes)
Elapsed time: 2 days, 13:44:14

I'm not opening another PR as there is already @yaolu's PR https://github.com/stas00/ml-engineering/pull/60.

Btw, @stas00 do you want to change https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/mamf-finder.py#L276C21-L276C48 so that it is outside the if condition to print the current configuration?

 {tflops:6.1f} TFLOPS @ {cur_config:<20}
yaolu commented 2 months ago

I just checked -- My scan is also done! Here are the numbers from my run:

493039 |  807.5 TFLOPS @ 12288x14336x15872    | best:  831.7 TFLOPS @ 12288x14336x15872 (MxNxK)

The best outcome was 831.7TFLOPS @ 12288x14336x15872 (MxNxK) (tried 493039 shapes)
Elapsed time: 2 days, 13:44:14

I'm not opening another PR as there is already @yaolu's PR #60.

Btw, @stas00 I think one should probably change https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/mamf-finder.py#L276C21-L276C48 to be outside the if condition to print the current configuration

 {tflops:6.1f} TFLOPS @ {cur_config:<20}

This result is even better. Likely due to the cooling system efficiency? I noticed that my highest GPU temperature was around 76C.

frankschae commented 2 months ago

That could be it! Scrolling back through my system login info, it seems like my temperature was at 56 C the entire time.

stas00 commented 2 months ago

I'm not opening another PR as there is already @yaolu's PR #60.

I added you to @yaolu's PR :)

Btw, @stas00 do you want to change https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/mamf-finder.py#L276C21-L276C48 so that it is outside the if condition to print the current configuration?

 {tflops:6.1f} TFLOPS @ {cur_config:<20}

You're absolutely correct - fixed in that same PR - thank you for noticing, @frankschae!

stas00 commented 2 months ago

resolved in https://github.com/stas00/ml-engineering/pull/60