Add gemm-flops support of Ada Lovelace (L4, L40, L40s), compute capability: 8.9

What's the issue, what's expected?: I started superbenchmark on server with NVIDIA L40 and got error message "Unsupported architecture" from gemm-flops benchmark. L40 and L4 are CUDA-capable NVIDIA GPUs with 8.9 Compute Capability, as listed in https://developer.nvidia.com/cuda-gpus

How to reproduce it?: sb run -f local.ini -c gemm-flops.yaml where gemm-flops.yaml is default.yaml with enable: ['gemm-flops'] and proc_num: 1

Log message or shapshot?:

[2024-05-23 16:39:42,832 l40-server:365][executor.py:248][INFO] Executor is going to execute gemm-flops.
[2024-05-23 16:39:43,450 l40-server:365][cuda_gemm_flops_performance.py:77][ERROR] Unsupported architecture - benchmark: gemm-flops, compute capability: 8.9, supports 7.0 7.5 8.0 8.6 9.0
[2024-05-23 16:39:43,450 l40-server:365][executor.py:133][INFO] benchmark: gemm-flops, return code: 34, result: {'return_code': [34]}.
[2024-05-23 16:39:43,450 l40-server:365][executor.py:140][ERROR] Executor failed in gemm-flops.

Additional information:

$ nvidia-smi --query-gpu=compute_cap --format=csv
compute_cap
8.9
$ nvidia-smi --query-gpu=gpu_name --format=csv
name
NVIDIA L40

I think compute capability 8.9 should be added to superbench/benchmarks/micro_benchmarks/cuda_gemm_flops_performance.py CudaGemmFlopsBenchmark __kernel_map similar to 8.6 (AD10x are similar to this group by having limited FP64 TFLOP rate). And there are two lists of ARCHS in third_party/Makefile for case CUDA Toolkit >= 11.8 with 86 and 90 which should be expanded by adding 89.

microsoft / superbenchmark

Add gemm-flops support of Ada Lovelace (L4, L40, L40s), compute capability: 8.9 #624