Closed frankschae closed 2 months ago
this is what I get for the slightly more exhaustive search:
Benchmark started on 2024-08-08 07:11:18
** Command line:
/usr/bin/python /workspace/mamf-finder.py --m_range 0 5376 256 --n_range 0 5376 256 --k_range 0 5376 256 --output_file=2024-08-08-07:11:16.txt
** Dtype: torch.bfloat16
** Platform/Device info:
Linux cc2be45bb63b 6.5.0-1019-nvidia-64k #19-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 12:54:40 UTC 2024 aarch64 aarch64
_CudaDeviceProperties(name='NVIDIA GH200 480GB', major=9, minor=0, total_memory=96768MB, multi_processor_count=132)
** Critical software versions:
torch=2.4.0a0+3bcc3cddb5.nv24.07
cuda=12.5
** Additional notes:
--------------------------------------------------------------------------------
The best outcome was 727.6TFLOPS @ 4864x4096x4352 (MxNxK) (tried 8000 shapes)
Elapsed time: 0:01:52
Script executed successfully.
Thank you, @frankschae!
But the 2nd result is worse than the first one. H* performs the best at bigger dimensions - please see: the H100 entry at https://github.com/stas00/ml-engineering/tree/master/compute/accelerator/benchmarks#examples-of-usage - surely GH200 should be at least as fast as H100, which I clocked 792.1@ 6144x17920x2816
Yup! I guess the second scan is also just not very exhaustive (?) I started a scan with the third setting ./mamf-finder.py --m_range 0 20480 256 --n_range 0 20480 256 --k_range 0 20480 256 --output_file=$(date +"%Y-%m-%d-%H:%M:%S").txt
which is still ongoing but where I see a larger number. Currently the last line of the output file is:
192791 | 784.8 TFLOPS @ 7168x19456x16896 | best: 814.6 TFLOPS @ 7168x19456x16896 (MxNxK)
Much better - H200 has a faster HBM so we should expect higher matmal TFLOPS
When finished if it resonates please make a PR to add a new entry at https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#maximum-achievable-matmul-flops-comparison-table
the theory column is 989, the efficiency col is your number/989*100
Here's the full search of GH200 result on my side.
best: 821.0 TFLOPS @ 11264x19712x1536 (MxNxK)
I just checked -- My scan is also done! Here are the numbers from my run:
493039 | 807.5 TFLOPS @ 12288x14336x15872 | best: 831.7 TFLOPS @ 12288x14336x15872 (MxNxK)
The best outcome was 831.7TFLOPS @ 12288x14336x15872 (MxNxK) (tried 493039 shapes)
Elapsed time: 2 days, 13:44:14
I'm not opening another PR as there is already @yaolu's PR https://github.com/stas00/ml-engineering/pull/60.
Btw, @stas00 do you want to change
https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/mamf-finder.py#L276C21-L276C48
so that it is outside the if
condition to print the current configuration?
{tflops:6.1f} TFLOPS @ {cur_config:<20}
I just checked -- My scan is also done! Here are the numbers from my run:
493039 | 807.5 TFLOPS @ 12288x14336x15872 | best: 831.7 TFLOPS @ 12288x14336x15872 (MxNxK) The best outcome was 831.7TFLOPS @ 12288x14336x15872 (MxNxK) (tried 493039 shapes) Elapsed time: 2 days, 13:44:14
I'm not opening another PR as there is already @yaolu's PR #60.
Btw, @stas00 I think one should probably change https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/mamf-finder.py#L276C21-L276C48 to be outside the
if
condition to print the current configuration{tflops:6.1f} TFLOPS @ {cur_config:<20}
This result is even better. Likely due to the cooling system efficiency? I noticed that my highest GPU temperature was around 76C.
That could be it! Scrolling back through my system login info, it seems like my temperature was at 56 C the entire time.
I'm not opening another PR as there is already @yaolu's PR #60.
I added you to @yaolu's PR :)
Btw, @stas00 do you want to change https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/benchmarks/mamf-finder.py#L276C21-L276C48 so that it is outside the
if
condition to print the current configuration?{tflops:6.1f} TFLOPS @ {cur_config:<20}
You're absolutely correct - fixed in that same PR - thank you for noticing, @frankschae!
resolved in https://github.com/stas00/ml-engineering/pull/60
Thanks for introducing the new performance metric! I'd like to contribute results for the GH200 chip. I ran the quick run in a docker container. If this looks good to you, I'm happy to run an exhaustive search.
2024-08-08-06:44:01.txt