intel / pti-gpu

Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily
MIT License
202 stars 57 forks source link

[zetracer] `zeCommandListAppendEventReset` bugs #14

Open TApplencourt opened 2 years ago

TApplencourt commented 2 years ago

Hi Anton,

@Kerilk and I are also developing a L0 tracer (https://github.com/argonne-lcf/THAPI). Recently we found that we don't handle the use case when a user resets an event with zeCommandListAppendEventReset. It looks like your zetracer has the same limitation (see the reproducer below).

In our tool supporting such use case will be expensive with the current L0 spec. We asked many times for L0 to add native callbacks (also on event change). This should greatly reduce the implementation complexity and overhead of tracing.

For now, our feedback didn't get a lot of traction. Maybe if two independent teams implementing tracing in two different source codes need callbacks, L0 will be more inclined to add callbacks...

So the question is, do you think having callbacks will help onetrace?

Reproducer

ze.cpp

#include <fstream>
#include <iostream>
#include <level_zero/ze_api.h>
#include <limits>
#include <memory>

#define zeCall(myZeCall)                                                                                                                                                                               \
  do {                                                                                                                                                                                                 \
    if (myZeCall != ZE_RESULT_SUCCESS) {                                                                                                                                                               \
      std::cout << "Error at " << #myZeCall << ": " << __FUNCTION__ << ": " << std::dec << __LINE__ << "\n";                                                                                           \
      std::terminate();                                                                                                                                                                                \
    }                                                                                                                                                                                                  \
  } while (0);

void foo(ze_context_handle_t context, ze_device_handle_t device, ze_kernel_handle_t kernel1,ze_kernel_handle_t kernel2) {
  // Some magic number

  const int computeOrdinal = 0;

  ze_command_queue_desc_t cmdQueueDesc = {};
  cmdQueueDesc.mode = ZE_COMMAND_QUEUE_MODE_ASYNCHRONOUS;
  cmdQueueDesc.ordinal = computeOrdinal;
  cmdQueueDesc.index = 0;
  ze_command_queue_handle_t queue;
  zeCall(zeCommandQueueCreate(context, device, &cmdQueueDesc, &queue));

  ze_command_list_desc_t listDesc = {};
  listDesc.commandQueueGroupOrdinal = computeOrdinal;
  ze_command_list_handle_t list;
  zeCall(zeCommandListCreate(context, device, &listDesc, &list));

  ze_group_count_t threadGroupCount = {};
  threadGroupCount.groupCountX = 1u;
  threadGroupCount.groupCountY = 1u;
  threadGroupCount.groupCountZ = 1u;

  // Create event pool
  ze_event_pool_desc_t eventPoolDesc = {
      ZE_STRUCTURE_TYPE_EVENT_POOL_DESC, NULL, ZE_EVENT_POOL_FLAG_KERNEL_TIMESTAMP,
      1 // One event on the pool
  };

  ze_event_pool_handle_t hEventPool;
  zeCall(zeEventPoolCreate(context, &eventPoolDesc, 1, &device, &hEventPool));

  ze_event_desc_t eventDesc = {
      ZE_STRUCTURE_TYPE_EVENT_DESC, NULL,
      0, // index
      0, // no memory/cache coherency required on signal
      0  // No need for memory/cache coherency on wait
  };
  ze_event_handle_t hEvent;
  zeCall(zeEventCreate(hEventPool, &eventDesc, &hEvent));

#ifdef K1
  std::cout<<"Sumiting K1" << std::endl;
  zeCall(zeCommandListAppendLaunchKernel(list, kernel1, &threadGroupCount, hEvent, 0, nullptr));
#endif
  zeCall(zeCommandListAppendBarrier(list, nullptr, 0, nullptr));
  zeCall(zeCommandListAppendEventReset(list, hEvent));
  zeCall(zeCommandListAppendBarrier(list, nullptr, 0, nullptr));
#ifdef K2
  std::cout<<"Sumiting K2" << std::endl;
  zeCall(zeCommandListAppendLaunchKernel(list, kernel2, &threadGroupCount, hEvent, 0, nullptr));
#endif
  zeCall(zeCommandListClose(list));

  zeCall(zeCommandQueueExecuteCommandLists(queue, 1, &list, nullptr));
  zeCall(zeCommandQueueSynchronize(queue, std::numeric_limits<uint64_t>::max()));

}

int main(int argc, char *argv[]) {
  zeCall(zeInit(ZE_INIT_FLAG_GPU_ONLY));

  uint32_t driverCount = 0;
  zeCall(zeDriverGet(&driverCount, nullptr));
  ze_driver_handle_t driverHandle;

  zeCall(zeDriverGet(&driverCount, &driverHandle));

  ze_context_handle_t context;
  ze_context_desc_t contextDesc = {};
  zeCall(zeContextCreate(driverHandle, &contextDesc, &context));

  // Get the root devices
  uint32_t deviceCount = 0;
  zeCall(zeDeviceGet(driverHandle, &deviceCount, nullptr));
  if (deviceCount == 0) {
    std::cout << "No devices found \n";
    std::terminate();
  }

  ze_device_handle_t device;
  deviceCount = 1;
  zeCall(zeDeviceGet(driverHandle, &deviceCount, &device));

  // Create kernel
  std::string kernelFile = "kernel_XE_HP_COREcore.spv";
  ze_module_format_t kernelFormat = ZE_MODULE_FORMAT_IL_SPIRV;

  std::ifstream file(kernelFile, std::ios_base::in | std::ios_base::binary);
  if (false == file.good()) {
    std::cout << kernelFile << " file not found\n";
    std::terminate();
  }

  uint32_t spirvSize = 0;
  file.seekg(0, file.end);
  spirvSize = static_cast<size_t>(file.tellg());
  file.seekg(0, file.beg);

  auto spirvModule = std::make_unique<char[]>(spirvSize);
  file.read(spirvModule.get(), spirvSize);

  ze_module_handle_t module;
  ze_module_desc_t moduleDesc = {};
  moduleDesc.format = kernelFormat;
  moduleDesc.pInputModule = reinterpret_cast<const uint8_t *>(spirvModule.get());
  moduleDesc.inputSize = spirvSize;
  zeCall(zeModuleCreate(context, device, &moduleDesc, &module, nullptr));

  ze_kernel_handle_t kernel1;
  ze_kernel_desc_t kernelDesc1 = {};
  kernelDesc1.pKernelName = "k1_noop";
  zeCall(zeKernelCreate(module, &kernelDesc1, &kernel1));
  zeCall(zeKernelSetGroupSize(kernel1, 256, 1, 1));

  ze_kernel_handle_t kernel2;
  ze_kernel_desc_t kernelDesc2 = {};
  kernelDesc2.pKernelName = "k2_sleep";
  zeCall(zeKernelCreate(module, &kernelDesc2, &kernel2));
  zeCall(zeKernelSetGroupSize(kernel2, 256, 1, 1));

  void *ptr1 = nullptr;
  ze_device_mem_alloc_desc_t deviceDesc1 = {};
  ze_host_mem_alloc_desc_t hostDesc1 = {};
  zeCall(zeMemAllocShared(context, &deviceDesc1, &hostDesc1, 64, 0, device, &ptr1));

  void *ptr2 = nullptr;
  ze_device_mem_alloc_desc_t deviceDesc2 = {};
  ze_host_mem_alloc_desc_t hostDesc2 = {};
  zeCall(zeMemAllocShared(context, &deviceDesc2, &hostDesc2, 64, 0, device, &ptr2));

  zeCall(zeKernelSetArgumentValue(kernel1, 0, 8, &ptr1));
  zeCall(zeKernelSetArgumentValue(kernel2, 0, 8, &ptr2));

  foo(context, device, kernel1, kernel2);
  return 0;
}

kernel.cl

#define MAD_4(x, y)     x = mad(y, x, y);   y = mad(x, y, x);   x = mad(y, x, y);   y = mad(x, y, x);
#define MAD_16(x, y)    MAD_4(x, y);        MAD_4(x, y);        MAD_4(x, y);        MAD_4(x, y);
#define MAD_64(x, y)    MAD_16(x, y);       MAD_16(x, y);       MAD_16(x, y);       MAD_16(x, y);

__kernel void k1_noop(__global double *ptr) {
    ptr[0] = 9;
}

__kernel void k2_sleep(__global double *ptr) {
    double x = (double)get_local_id(1);
    double y = (double)get_local_id(0);
    for(int i=0; i<1024*64; i++)
    {
        MAD_64(x, y);
    }
    ptr[0] = y;
}

Compile

ocloc compile -file kernel.cl -device $FOO
icpx -lze_loader ze.cpp -Wall -DK1-o k1
icpx -lze_loader ze.cpp -Wall -DK2 -o k2
icpx -lze_loader ze.cpp -Wall -DK1 -DK2 -o k1k2

What we should expect?

We should expect k1 to show the kernel execution. But we don't see it

onetrace ./k1

=== API Timing Results: ===

             Total Execution Time (ns):            186368143
    Total API Time for L0 backend (ns):            185654838

== L0 Backend: ==

                         Function,       Calls,           Time (ns),  Time (%),        Average (ns),            Min (ns),            Max (ns)
                   zeModuleCreate,           1,           181219427,     97.61,           181219427,           181219427,           181219427
              zeCommandListCreate,           1,             1629179,      0.88,             1629179,             1629179,             1629179
             zeCommandQueueCreate,           1,              845735,      0.46,              845735,              845735,              845735
                 zeMemAllocShared,           2,              830426,      0.45,              415213,              178284,              652142
zeCommandQueueExecuteCommandLists,           1,              558334,      0.30,              558334,              558334,              558334
        zeCommandQueueSynchronize,           1,              309534,      0.17,              309534,              309534,              309534
                zeEventPoolCreate,           1,              177962,      0.10,              177962,              177962,              177962
                    zeEventCreate,           1,               53095,      0.03,               53095,               53095,               53095
    zeCommandListAppendEventReset,           1,                7936,      0.00,                7936,                7936,                7936
                   zeKernelCreate,           2,                6768,      0.00,                3384,                 946,                5822
       zeCommandListAppendBarrier,           2,                5811,      0.00,                2905,                1573,                4238
         zeKernelSetArgumentValue,           2,                5096,      0.00,                2548,                1103,                3993
             zeKernelSetGroupSize,           2,                2299,      0.00,                1149,                 224,                2075
                  zeContextCreate,           1,                1710,      0.00,                1710,                1710,                1710
               zeCommandListClose,           1,                 675,      0.00,                 675,                 675,                 675
                      zeDeviceGet,           2,                 374,      0.00,                 187,                 132,                 242
                      zeDriverGet,           2,                 275,      0.00,                 137,                  49,                 226
                           zeInit,           1,                 202,      0.00,                 202,                 202,                 202

=== Device Timing Results: ===

                Total Execution Time (ns):            186368143
    Total Device Time for L0 backend (ns):                 3680

== L0 Backend: ==

                    Kernel,       Calls,           Time (ns),    Time (%),        Average (ns),            Min (ns),            Max (ns)
zeCommandListAppendBarrier,           2,                3680,      100.00,                1840,                1280,                2400

And if we run k1 and k2, we have timing for each kernel but they correspond only to k2

onetrace ./k1k2

=== API Timing Results: ===

             Total Execution Time (ns):            253710707
    Total API Time for L0 backend (ns):            252516062

== L0 Backend: ==

                         Function,       Calls,           Time (ns),  Time (%),        Average (ns),            Min (ns),            Max (ns)
                   zeModuleCreate,           1,           183044952,     72.49,           183044952,           183044952,           183044952
        zeCommandQueueSynchronize,           1,            59487507,     23.56,            59487507,            59487507,            59487507
zeCommandQueueExecuteCommandLists,           1,             6742682,      2.67,             6742682,             6742682,             6742682
              zeCommandListCreate,           1,             1637588,      0.65,             1637588,             1637588,             1637588
                 zeMemAllocShared,           2,              821432,      0.33,              410716,              291064,              530368
             zeCommandQueueCreate,           1,              670053,      0.27,              670053,              670053,              670053
                    zeEventCreate,           1,               51961,      0.02,               51961,               51961,               51961
  zeCommandListAppendLaunchKernel,           2,               16393,      0.01,                8196,                3280,               13113
                zeEventPoolCreate,           1,               13748,      0.01,               13748,               13748,               13748
    zeCommandListAppendEventReset,           1,                7173,      0.00,                7173,                7173,                7173
                   zeKernelCreate,           2,                6697,      0.00,                3348,                 948,                5749
         zeKernelSetArgumentValue,           2,                5205,      0.00,                2602,                1089,                4116
       zeCommandListAppendBarrier,           2,                4852,      0.00,                2426,                1297,                3555
             zeKernelSetGroupSize,           2,                2327,      0.00,                1163,                 252,                2075
                  zeContextCreate,           1,                2051,      0.00,                2051,                2051,                2051
               zeCommandListClose,           1,                 617,      0.00,                 617,                 617,                 617
                      zeDeviceGet,           2,                 325,      0.00,                 162,                 113,                 212
                      zeDriverGet,           2,                 294,      0.00,                 147,                  44,                 250
                           zeInit,           1,                 205,      0.00,                 205,                 205,                 205

=== Device Timing Results: ===

                Total Execution Time (ns):            253710707
    Total Device Time for L0 backend (ns):                 8640

== L0 Backend: ==

                    Kernel,       Calls,           Time (ns),    Time (%),        Average (ns),            Min (ns),            Max (ns)
                     sleep,           1,                3200,       37.04,                3200,                3200,                3200
                      noop,           1,                3200,       37.04,                3200,                3200,                3200
zeCommandListAppendBarrier,           2,                2240,       25.93,                1120,                1120,                1120

Hope this help, Don't hesitate if you have any feedback.

anton-v-gorshkov commented 2 years ago

Hi @TApplencourt, thanks for your feedback! Please give me some time to dive deeply into this - first of all I'd like to understand how general is this problem. But even right now I agree that current kernel tracing for Level Zero looks over-complicated, and we probably need to simplify things somehow.

anton-v-gorshkov commented 2 years ago

Hi @TApplencourt,

The problem you've reported is known for us, and indeed we don't support such a case in our tools for now. The reason - we are not aware of any customer application that uses such an approach. Do you face with this case in real life, or it's just a reproducer?

Of cause the lack of support right now doesn't mean we don't plan to add it. To deal with such a case zeCommandListAppendQueryKernelTimestamps function should be used. Note also, that having callback in Level Zero does not resolve this issue by itself since it's more about current Level Zero design. But yes, it can make customers lives easier by moving all the problems inside Level Zero rather than having them outside.

Currently we are thinking about an approach similar to CUPTI Activity, where one can be subscribed to some event (e.g. kernel invocation) to be notified asynchronously (with a callback) if this event happened. Do you believe this is something you would prefer to use?

TApplencourt commented 2 years ago

Do you face with this case in real life, or it's just a reproducer?

Just a reproducer (for now :D)

To deal with such a case zeCommandListAppendQueryKernelTimestamps function should be used.

Oh yes, this required a little infrastructure (allocating device memory, handling offset, ...) but totally feasible indeed!

Currently we are thinking about an approach similar to CUPTI Activity, where one can be subscribed to some event (e.g. kernel invocation) to be notified asynchronously (with a callback) if this event happened. Do you believe this is something you would prefer to use?

Something around those lines sounds good! But I'm not by any means an expert, I will let @Kerilk write a more insightful reply.