krrishnarraj / clpeak

A tool which profiles OpenCL devices to find their peak capacities
Apache License 2.0
396 stars 111 forks source link

enqueueWriteBuffer: Initialize host buffer to obtain accurate measure… #62

Closed ssanchez11 closed 4 years ago

ssanchez11 commented 4 years ago

…ment

When a host buffer is passed as a source into enqueueWriteBuffer(), a memcpy() is used by OpenCL. memcpy() is optimized to copy zero pages. Newly allocated memory points to zero pages, and when the memory is written to, physical memory is allocated.

Therefore, initialize host buffer to obtain accurate measurements with enqueueWriteBuffer().

Results on Intel hardware:

Before:
    Platform: Intel(R) OpenCL HD Graphics
      Device: Intel(R) Gen9 HD Graphics NEO
        Driver version  : 19.03.0 (Linux x64)
        Compute units   : 48
        Clock frequency : 1200 MHz

        Transfer bandwidth (GBPS)
          enqueueWriteBuffer         : 34.18
          enqueueReadBuffer          : 13.02
          enqueueMapBuffer(for read) : 14316530.00
            memcpy from mapped ptr   : 13.01
          enqueueUnmap(after write)  : inf
            memcpy to mapped ptr     : 13.37

After:
    Platform: Intel(R) OpenCL HD Graphics
      Device: Intel(R) Gen9 HD Graphics NEO
        Driver version  : 19.03.0 (Linux x64)
        Compute units   : 48
        Clock frequency : 1200 MHz

        Transfer bandwidth (GBPS)
          enqueueWriteBuffer         : 13.44
          enqueueReadBuffer          : 12.91
          enqueueMapBuffer(for read) : 21474796.00
            memcpy from mapped ptr   : 12.91
          enqueueUnmap(after write)  : inf
            memcpy to mapped ptr     : 13.44
rwmcguir commented 4 years ago

This change is necessary likely for other hardware platforms and even host only implementations. Linux kernels in general have moved to using zero pages for uninitialized memory. Previous kernels (i.e. back in the 2.6 world), would initialize memory upon reading, i.e. page in REAL unique pages. I don't know when this change took place exactly, however, now any copy from non-initialized memory will likely not bring in memory and result in a highly CPU optimized read operation from a single page that gets fully cached in L1 cache. This provides very unrealistic results shown above. This is not just an OpenCL issue, but any copy from any region within Linux that is not initialized. As such ALL benchmarks providing bandwidth need to ensure that any READ operation is from fully initialized memory pages.

See this page for background: https://lwn.net/Articles/340370/

ssanchez11 commented 4 years ago

Krishnaraj,

Could you please provide feedback or approve this merge request?

Thank you,

Sebastian

krrishnarraj commented 4 years ago

Yea. That makes sense. Thanks