krrishnarraj / clpeak

A tool which profiles OpenCL devices to find their peak capacities
Apache License 2.0
386 stars 109 forks source link

New hardware results: Rdna3 7900xt/x ,geforce 4090/4080/4070 and Intel Arc A770 results? #99

Open oscarbg opened 1 year ago

oscarbg commented 1 year ago

Hi,

Title says it all..

Wanting to see results of new Nv 40x0 series, Amd rdna3 and intel dg2..

hope people with needed hardware can submit them..

Thanks..

al42and commented 1 year ago

My results with 6.2.1 kernel for Arc A770:

Platform: Intel(R) OpenCL HD Graphics
  Device: Intel(R) Graphics [0x56a0]
    Driver version  : 22.49.25018.24 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Global memory bandwidth (GBPS)
      float   : 397.92
      float2  : 403.43
      float4  : 407.01
      float8  : 417.52
      float16 : 421.01

    Single-precision compute (GFLOPS)
      float   : 13018.01
      float2  : 11137.58
      float4  : 10403.04
      float8  : 10026.99
      float16 : 9701.60

    Half-precision compute (GFLOPS)
      half   : 19552.90
      half2  : 19493.52
      half4  : 19526.21
      half8  : 19459.81
      half16 : 19340.77

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 4765.67
      int2  : 4773.43
      int4  : 4789.65
      int8  : 4644.51
      int16 : 5455.67

    Integer compute Fast 24bit (GIOPS)
      int   : 4755.75
      int2  : 4768.87
      int4  : 4786.68
      int8  : 4642.19
      int16 : 5455.34

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 2.64
      enqueueReadBuffer               : 2.43
      enqueueWriteBuffer non-blocking : 2.85
      enqueueReadBuffer non-blocking  : 2.63
      enqueueMapBuffer(for read)      : 2.83
        memcpy from mapped ptr        : 14.38
      enqueueUnmap(after write)       : 2.91
        memcpy to mapped ptr          : 14.01

    Kernel launch latency : 36.30 us
oscarbg commented 1 year ago

@al42and nice.. thanks for sharing.. would be nice to have Windows results also to see they not diverge much if you have Windows installed also..

al42and commented 1 year ago

Don't have Windows :(

retoXD commented 1 year ago

Kernel latency seems worse on Windows.

Platform: Intel(R) OpenCL HD 
  Graphics Device: Intel(R) Arc(TM) A770 
    Graphics Driver version : 31.0.101.4255 (Win64) 
    Compute units : 512 
    Clock frequency : 2400 MHz

Global memory bandwidth (GBPS)
  float   : 396.30
  float2  : 403.57
  float4  : 409.15
  float8  : 419.49
  float16 : 423.01

Single-precision compute (GFLOPS)
  float   : 13346.34
  float2  : 11416.61
  float4  : 10663.24
  float8  : 10299.98
  float16 : 9975.71

Half-precision compute (GFLOPS)
  half   : 20033.96
  half2  : 19979.07
  half4  : 19969.53
  half8  : 19922.98
  half16 : 19841.67

No double precision support! Skipped

Integer compute (GIOPS)
  int   : 4830.21
  int2  : 4857.29
  int4  : 4846.14
  int8  : 4724.30
  int16 : 5532.68

Integer compute Fast 24bit (GIOPS)
  int   : 4824.44
  int2  : 4850.69
  int4  : 4829.88
  int8  : 4694.66
  int16 : 5510.71

Transfer bandwidth (GBPS)
  enqueueWriteBuffer              : 11.21
  enqueueReadBuffer               : 5.33
  enqueueWriteBuffer non-blocking : 15.99
  enqueueReadBuffer non-blocking  : 6.21
  enqueueMapBuffer(for read)      : 19.14
    memcpy from mapped ptr        : 19.38
  enqueueUnmap(after write)       : 17.15
    memcpy to mapped ptr          : 19.76

Kernel launch latency : 78.90 us
leuc commented 1 year ago

kernel 5.17.0-1020-oem and intel-i915-dkms 1.23.3.19.230122.18.5.17.0.1020+i38-1 but bandwidth capped with PCI 3.0

Platform: Intel(R) OpenCL HD Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 23.05.25593.18 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Global memory bandwidth (GBPS)
      float   : 399.42
      float2  : 403.78
      float4  : 408.53
      float8  : 418.51
      float16 : 422.97

    Single-precision compute (GFLOPS)
      float   : 13000.09
      float2  : 11134.71
      float4  : 10402.13
      float8  : 10024.48
      float16 : 9706.12

    Half-precision compute (GFLOPS)
      half   : 19552.26
      half2  : 19500.15
      half4  : 19505.83
      half8  : 19463.29
      half16 : 19341.72

    No double precision support! Skipped

    Integer compute (GIOPS)
      int   : 4311.91
      int2  : 4322.29
      int4  : 4339.57
      int8  : 4212.78
      int16 : 4920.77

    Integer compute Fast 24bit (GIOPS)
      int   : 4307.33
      int2  : 4327.73
      int4  : 4341.63
      int8  : 4203.23
      int16 : 4906.83

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 9.47
      enqueueReadBuffer               : 4.50
      enqueueWriteBuffer non-blocking : 11.07
      enqueueReadBuffer non-blocking  : 4.86
      enqueueMapBuffer(for read)      : 10.10
        memcpy from mapped ptr        : 4.80
      enqueueUnmap(after write)       : 11.38
        memcpy to mapped ptr          : 15.45

    Kernel launch latency : 9.05 us