intel / pti-gpu

Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily
MIT License
202 stars 57 forks source link

Request to support SYCL graph tracing #60

Closed ytzhang1 closed 4 months ago

ytzhang1 commented 8 months ago

SYCL graph is an experimental feature released in oneapi 2024.0, and it seems that unitrace or onetrace doesn't support tracing of sycl graph kernels. In the following image, the device activities only show 3 zeCommandListAppendBarrier.
sycl-graph-tracing

I used the following command to trace the binary, unitrace --demangle --chrome-device-activities --chrom-kernel-activities ./sycl-graph-app It'll be great if unitrace can trace sycl graph either in graph granularity or kernel granularity.

Sarbojit2019 commented 8 months ago

@ytzhang1, Could you please share me your sycl-graph-app code? I need the source to build it locally and test it with Unitrace. I am working on enabling Sycl-graph on Unitrace.

ytzhang1 commented 8 months ago

Hi @Sarbojit2019 , The following is a test code I have been using, Thanks!

#include <iostream>
#include <sycl/sycl.hpp>

void run1(sycl::queue &q, float* dst, float* src, float* tmp1, float* tmp2, int count)
{
    sycl::event ek1 = q.submit([&](sycl::handler &h) {
        h.parallel_for(count, [=](sycl::item<1> item) {
            int idx = item.get_id(0);
            tmp1[idx] = src[idx] * 2;
        });
    });    
    sycl::event ek2 = q.submit([&](sycl::handler &h) {
        h.parallel_for(count, [=](sycl::item<1> item) {
            int idx = item.get_id(0);
            tmp2[idx] = tmp1[idx] * 3;
        });
    }); 
    sycl::event ek3 = q.submit([&](sycl::handler &h) {
        h.parallel_for(count, [=](sycl::item<1> item) {
            int idx = item.get_id(0);
            dst[idx] = tmp2[idx] + 11;
        });
    });
}

int test0()
{
    sycl::queue q{sycl::gpu_selector_v, {sycl::property::queue::in_order(),
                                       sycl::ext::intel::property::queue::no_immediate_command_list()}};

    int count = 1024 * 1024;
    float *inp = sycl::malloc_device<float>(count, q);
    float *outp = sycl::malloc_device<float>(count, q);
    float *tmp1 = sycl::malloc_device<float>(count, q);
    float *tmp2 = sycl::malloc_device<float>(count, q);    
    float *inp_h = new float[count];
    float *outp_h = new float[count];
    for (size_t i = 0; i < count; ++i) {
      inp_h[i] = i/4;
      outp_h[i] = -1;
    }

    q.memcpy(inp, inp_h, count * sizeof(float)).wait();

    // record graph
    sycl::ext::oneapi::experimental::command_graph g {q.get_context(), q.get_device()};
    g.begin_recording(q);
    run1(q, outp, inp, tmp1, tmp2, count);
    g.end_recording();
    auto execGraph = g.finalize();
    q.ext_oneapi_graph(execGraph).wait();

    q.memcpy(outp_h, outp, count * sizeof(float)).wait();

    std::cout << "test finished." << std::endl;
    return 0;
}

int main(int argc, char** argv)
{
  test0();
}
zma2 commented 4 months ago

@ytzhang1 @Sarbojit2019 This is what I got with latest unitrace cloned from the repo (@Sarbojit2019 this does not have what you have been working on. Just the organic mainline in the repo):

image

All the kernels are there.

@ytzhang1 Please give it a try,

ytzhang1 commented 4 months ago

Hi @zma2, thanks for pointing that out! I verified that with the latest pti-gpu, it can show the kernel executions! thanks! I'll close this issue.