krrishnarraj / clpeak

A tool which profiles OpenCL devices to find their peak capacities
Apache License 2.0
404 stars 113 forks source link

Half Precision not detected for RTX 3090 #104

Open BA8F0D39 opened 1 year ago

BA8F0D39 commented 1 year ago

clpeak version: 1.1.2

Platform: NVIDIA CUDA
  Device: NVIDIA GeForce RTX 3090
    Driver version  : 525.89.02 (Linux x64)
    Compute units   : 82
    Clock frequency : 1725 MHz

    Global memory bandwidth (GBPS)
      float   : 816.91
      float2  : 841.68
      float4  : 856.31
      float8  : 785.62
      float16 : 844.80

    Single-precision compute (GFLOPS)
      float   : 35976.15
      float2  : 35279.88
      float4  : 35448.44
      float8  : 35229.30
      float16 : 34781.18

    No half precision support! Skipped

    Double-precision compute (GFLOPS)
      double   : 635.40
      double2  : 634.58
      double4  : 633.12
      double8  : 630.11
      double16 : 624.10

    Integer compute (GIOPS)
      int   : 19650.09
      int2  : 19531.53
      int4  : 19486.43
      int8  : 19548.59
      int16 : 19539.19

    Integer compute Fast 24bit (GIOPS)
      int   : 19452.70
      int2  : 18920.43
      int4  : 19145.33
      int8  : 19143.94
      int16 : 19075.51

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 9.96
      enqueueReadBuffer               : 10.48
      enqueueWriteBuffer non-blocking : 5.47
      enqueueReadBuffer non-blocking  : 5.55
      enqueueMapBuffer(for read)      : 10.76
        memcpy from mapped ptr        : 15.20
      enqueueUnmap(after write)       : 13.04
        memcpy to mapped ptr          : 15.20

    Kernel launch latency : 3.56 us
moyang commented 1 year ago

There is no native half-precision support on NVIDIA Ampere (except for A100) or Ada GPU. Their half-precision performance is the same as single-precision.

BA8F0D39 commented 1 year ago

@moyang RTX 3090 has native FP16 support in tensor cores https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf

512 FP16 FMA per SM 128 FP16 FMA per Tensor core

RTX 3090 has 82 SM and 328 Tensor cores

moyang commented 1 year ago

@BA8F0D39 This seems to be a problem with NVIDIA's OpenCL implementation. When querying device capabilities by apps (like clpeak), it reports "no half-precision support". I observed the same issue with other benchmarks, like SiSoftware Sandra. .