intel / compute-runtime

Intel® Graphics Compute Runtime for oneAPI Level Zero and OpenCL™ Driver
MIT License
1.15k stars 234 forks source link

Clarification on `zetMetricStreamerReadData` Behavior for Non-Overlapping Kernel Profiling #767

Open jssonx opened 1 month ago

jssonx commented 1 month ago

Environment

Context

I'm developing a profiler for SYCL offload programs. My approach involves serializing kernel launches using zeEventHostSynchronize to ensure only one kernel is offloaded to the Intel GPU device at a time. For each kernel, I use a profiling thread to read stall sampling data using zetMetricStreamerReadData.

Current Implementation

Currently, after each kernel execution, I collect and process the data. To ensure non-overlapping stall samples between kernels, I've implemented a manual buffer flushing function zeroFlushStreamerBuffer(streamer, desc). This function closes the current streamer and opens a new one.

void zeroFlushStreamerBuffer(zet_metric_streamer_handle_t& streamer, ZeDeviceDescriptor* desc)
{
    ze_result_t status = ZE_RESULT_SUCCESS;
    // Close the old streamer
    status = zetMetricStreamerClose(streamer);
    level0_check_result(status, **LINE**);
    // Open a new streamer
    uint32_t interval = 500000; // ns
    zet_metric_streamer_desc_t streamer_desc = {ZET_STRUCTURE_TYPE_METRIC_STREAMER_DESC, nullptr, max_metric_samples, interval};
    status = zetMetricStreamerOpen(desc->context_, desc->device_, desc->metric_group_, &streamer_desc, nullptr, &streamer);
    if (status != ZE_RESULT_SUCCESS) {
        std::cerr << "[ERROR] Failed to open metric streamer (" << status << "). The sampling interval might be too small." << std::endl;
        streamer = nullptr;
        return;
    }
    if (streamer_desc.notifyEveryNReports > max_metric_samples) {
        max_metric_samples = streamer_desc.notifyEveryNReports;
    }
}

Current Implementation Details

To provide more context, here's the main profiling loop where zeroFlushStreamerBuffer is used:

void 
ZeMetricProfiler::RunProfilingLoop
(
  ZeDeviceDescriptor* desc,
  zet_metric_streamer_handle_t& streamer
)
{
  std::vector<uint8_t> raw_metrics(MAX_METRIC_BUFFER + 512);
  desc->profiling_state_.store(PROFILER_ENABLED, std::memory_order_release);
  ze_result_t status;

  while (desc->profiling_state_.load(std::memory_order_acquire) != PROFILER_DISABLED) {
    // Wait for the kernel to start running
    while (true) {
      status = zeEventHostSynchronize(desc->serial_kernel_start_, 50000000);
      if (status == ZE_RESULT_SUCCESS) {
        break;
      }
      // Handle case where kernel execution is extremely short:
      // In such cases, the kernel might finish before zeEventHostSynchronize can detect the start event.
      // Without this check, a deadlock could occur:
      // - The Profiling thread would keep waiting for the start event (which has already been reset).
      // - The App thread would be waiting for the Profiling thread to complete data processing.
      // kernel_started_ allows Profiling thread to proceed, avoiding deadlock.
      if (desc->kernel_started_.load(std::memory_order_acquire)) {
        break;
      }
      if (desc->profiling_state_.load(std::memory_order_acquire) == PROFILER_DISABLED) {
        return;
      }
    }
    // Kernel is running, enter sampling loop
    while (true) {
      // Update correlation ID
      gpu_correlation_channel_receive(1, UpdateCorrelationID, desc);
      // Wait for the next interval
      status = zeEventHostSynchronize(desc->serial_kernel_end_, 5000);
      if (status == ZE_RESULT_SUCCESS) {
        break;
      }
      CollectAndProcessMetrics(desc, streamer, raw_metrics);
    }
    // Kernel has finished, perform final sampling and cleanup
    CollectAndProcessMetrics(desc, streamer, raw_metrics);
    // FIXME(Yuning): may need a better way to flush the streamer buffer without repeatedly closing and reopening the streamer
    zeroFlushStreamerBuffer(streamer, desc);
    desc->running_kernel_ = nullptr;
    desc->kernel_started_.store(false, std::memory_order_release);

    // Notify the app thread that data processing is complete
    status = zeEventHostSignal(desc->serial_data_ready_);
    level0_check_result(status, **LINE**);
  }
}

This code demonstrates how we currently handle metric collection for each kernel execution, including the use of zeroFlushStreamerBuffer to attempt non-overlapping data collection between kernels.

Questions

  1. Data Overlap: When collecting data for a kernel after its execution, is there a possibility that the data from zetMetricStreamerReadData includes stall samples from the previous kernel? My goal is to obtain non-overlapping stall samples for each kernel to enable fine-grained performance analysis.

  2. API Enhancement: If my understanding is correct, would it be possible to provide a levelzero API for flushing the metrics streamer, such as zetMetricStreamerFlushData? This could potentially be more efficient than the current zeroFlushStreamerBuffer implementation.

  3. Clarification: If my understanding is incorrect, could you please confirm that each call to zetMetricStreamerReadData always returns non-overlapping data? This would allow me to remove the zeroFlushStreamerBuffer function, potentially improving performance.

Request

I would greatly appreciate clarification on the behavior of zetMetricStreamerReadData in this context and any guidance on the best practices for ensuring non-overlapping metric collection between kernel executions.

jssonx commented 1 month ago

@jmellorcrummey FYI

joshuaranjan commented 1 month ago

Hi,

Data Overlap: When collecting data for a kernel after its execution, is there a possibility that the data from zetMetricStreamerReadData includes stall samples from the previous kernel? My goal is to obtain non-overlapping stall samples for each kernel to enable fine-grained performance analysis.

From the API specification point of view, currently the only way to ensure this is to close and open the streamer. However this behaviour could be platform specific. For example On Aurora, If the previous kernel execution is completed (ensured using a HostSynchronize call) and all the stall data is read-out before the next kernel execution, then there should not be any overlaps in the stall data.

API Enhancement: If my understanding is correct, would it be possible to provide a levelzero API for flushing the metrics streamer, such as zetMetricStreamerFlushData? This could potentially be more efficient than the current zeroFlushStreamerBuffer implementation.

Yes. We are internally discussing the usefulness of such an API and having the use-case like you suggested would help finalize it.

Clarification: If my understanding is incorrect, could you please confirm that each call to zetMetricStreamerReadData always returns non-overlapping data? This would allow me to remove the zeroFlushStreamerBuffer function, potentially improving performance.

I think I have clarified this above. Please share if there are further clarifications.