intel / pti-gpu

Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily
MIT License
198 stars 52 forks source link

[PTI-SDK] Device / context-based buffers instead of thread-based buffers #54

Open Thyre opened 8 months ago

Thyre commented 8 months ago

Device / context-based buffers instead of thread-based buffers

While continuing to evaluate how we may be able to use PTI-SDK for support of Level Zero as an adapter in Score-P, I've ran into the following issue:

Right now, PTI-SDK collects events for different kinds of activities on accelerators, which can be enabled through ptiViewSetCallbacks. At some point during program execution, the implemented buffer_request function will be called. If requested or when a buffer is full, the SDK may dispatch a callback for buffer evaluation. This is totally fine. However, I noticed a detail, significantly complicating the handling of programs using multiple threads to dispatch events.

To illustrate the issue, we can look at the following (very simple) OpenMP offload program:

int main(void)
{
    #pragma omp parallel num_threads( 2 )
    {
        unsigned long long int x = 0;
        for(int i = 0; i < 10; ++i) {
            #pragma omp target map(tofrom: x)
            {
            ++x;
        }
        }
    }
}

We have eight threads working in parallel on a single accelerator. This does work and events are correctly captured by PTI-SDK. Now, lets look at how they are captured.

How PTI-SDK PoC currently captures events

Events can be generally found in view_handler.h. For simplicity, we focus on MemCopyEvent but others follow the same principle.

At the end of the event method, a call to Instance().InsertRecord(...) is being done. This is a templated method with the following code

template <typename T>
inline void InsertRecord(const T& view_record) {
    static_assert(std::is_trivially_copyable<T>::value,
                  "One can only insert trivially copyable types into the "
                  "ViewBuffer (view records)");
    auto& buffer = view_buffers_[std::this_thread::get_id()];

    if (buffer.IsNull()) {
        RequestNewBuffer(buffer);
    }

    buffer.Insert(view_record);
    static_assert(SizeOfLargestViewRecord() != 0, "Largest record not avaiable on compile time");
    if (buffer.FreeBytes() >= SizeOfLargestViewRecord()) {
        // There's space to insert more records. No need for swap.
        return;
    }

    buffer_queue_.Push(std::move(buffer));
}

Note the way we determine the buffer. This is done through the unique id of the thread writing the event. In the parallel OpenMP region, this is the executing thread. Looking further at how the buffers are implemented, we end up here: using ViewBufferTable = ThreadSafeHashTable<KeyT, ViewBuffer>;. This means, that events are stored in a buffer and accessed through a hash table with the thread id being the key.

What the current implementation does

Regardless on the devices, contexts, and command queues being used by a thread, events are stored on a thread basis. This can cause issues if tools require events to be written in a certain way. In Score-P for example, we require our locations (where we store our events) to write events in timestamp order. With PTI-SDK however, this is quite difficult. Let's look at the output of the example above with some interface:

Click to open ``` -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061487291 ns Ze Kernel End Time: 1704727757061490207 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 15 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061632913 ns Ze Kernel End Time: 1704727757061635829 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 16 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061857128 ns Ze Kernel End Time: 1704727757061859523 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 18 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061907886 ns Ze Kernel End Time: 1704727757061910281 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 20 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061985554 ns Ze Kernel End Time: 1704727757061987949 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 22 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062035503 ns Ze Kernel End Time: 1704727757062038003 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 24 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062084320 ns Ze Kernel End Time: 1704727757062086715 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 26 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062133252 ns Ze Kernel End Time: 1704727757062135647 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 28 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062183094 ns Ze Kernel End Time: 1704727757062185489 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 30 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062232119 ns Ze Kernel End Time: 1704727757062234514 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 32 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062280055 ns Ze Kernel End Time: 1704727757062282555 ns Kernel Queue Handle: 0x7f5438017ae0 Kernel Device Handle: 0x2324710 Kernel Id : 33 Kernel Thread Id : 670104 -------------------------------------------------------------------------------- Reached End of buffer -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(D2M) Memory Op Start Time: 1704727757057692365 ns Memory Op End Time: 1704727757057696219 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 1 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(D2M) Memory Op Start Time: 1704727757059441071 ns Memory Op End Time: 1704727757059443883 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 2 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(D2M) Memory Op Start Time: 1704727757059629202 ns Memory Op End Time: 1704727757059632952 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 3 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(D2M) Memory Op Start Time: 1704727757059699457 ns Memory Op End Time: 1704727757059702790 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 4 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(D2M) Memory Op Start Time: 1704727757059772795 ns Memory Op End Time: 1704727757059776232 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 5 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(M2D) Memory Op Start Time: 1704727757059906260 ns Memory Op End Time: 1704727757059910114 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 6 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(M2D) Memory Op Start Time: 1704727757060523755 ns Memory Op End Time: 1704727757060527088 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 7 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(M2D) Memory Op Start Time: 1704727757060623218 ns Memory Op End Time: 1704727757060626447 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 8 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(M2D) Memory Op Start Time: 1704727757060690014 ns Memory Op End Time: 1704727757060693347 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 9 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(M2D) Memory Op Start Time: 1704727757060761374 ns Memory Op End Time: 1704727757060765124 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 10 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(M2D) Memory Op Start Time: 1704727757060830073 ns Memory Op End Time: 1704727757060833406 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 11 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(M2D) Memory Op Start Time: 1704727757060907619 ns Memory Op End Time: 1704727757060911264 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 12 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Memory Record Memory Op: zeCommandListAppendMemoryCopy(M2D) Memory Op Start Time: 1704727757061004800 ns Memory Op End Time: 1704727757061008445 ns Memory Op Queue Handle: 0x30e63c0 Memory Op Device Handle: 0x2324710 Memory Op CommandList Context Handle: 0x2342b50 Memory Op Id : 13 Memory Op Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061145847 ns Ze Kernel End Time: 1704727757061148763 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 14 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061840664 ns Ze Kernel End Time: 1704727757061842955 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 17 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061897810 ns Ze Kernel End Time: 1704727757061900205 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 19 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061948646 ns Ze Kernel End Time: 1704727757061951041 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 21 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757061996369 ns Ze Kernel End Time: 1704727757061998660 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 23 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062046902 ns Ze Kernel End Time: 1704727757062049297 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 25 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062095791 ns Ze Kernel End Time: 1704727757062098186 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 27 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062144432 ns Ze Kernel End Time: 1704727757062146723 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 29 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Found Kernel Record Ze Kernel Start Time: 1704727757062196928 ns Ze Kernel End Time: 1704727757062199323 ns Kernel Queue Handle: 0x30e63c0 Kernel Device Handle: 0x2324710 Kernel Id : 31 Kernel Thread Id : 670096 -------------------------------------------------------------------------------- Reached End of buffer ```

The output is pretty large, but shows a weird thing. The following entry can be found in the buffer for Kernel Thread Id = 670104, even though the event is from another Kernel Thread Id

Found Kernel Record
Ze Kernel Start Time: 1704727757061632913 ns
Ze Kernel End Time: 1704727757061635829 ns
Kernel Queue Handle: 0x30e63c0
Kernel Device Handle: 0x2324710
Kernel Id : 16
Kernel Thread Id : 670096

If we evaluate the first buffer first and then the second one, we will end up with timestamp errors coming from Score-P, since 1704727757057692365 (first event of second buffer) < 1704727757061632913 (wrong event in first buffer).

The issue

From my understanding, each thread will execute events on a separate command queue, if possible. My question here is: Is it possible that command queues are used by multiple threads at the same time? In general, I am a bit skeptical about using thread ids as the key. If a buffer is not completely filled, but contains events for a context, device, or command queue and is flushed at the end of the program, performance tools need to store all events happening during program execution because there might be an event which gets missed or cause other issues otherwise.

For the behavior shown above, there seem to be events stored incorrectly, as I wouldn't expect to see a thread id for another thread in that buffer.

Side note

It seems like this isn't the only issue with multiple threads. When running the program multiple times, I've also ran into the following error:

a.out: /opt/apps/sources/PTI-SDK/9ee0e46cafa145856eaeeefe5f26ec046462300f/sdk/src/levelzero/ze_collector.h:1446: void ZeCollector::GetHostTime(const ZeKernelCommand *, const ze_kernel_timestamp_result_t &, uint64_t &, uint64_t &): Assertion `host_start > command->submit_time' failed.
[1]    669066 IOT instruction  ./a.out

Reproducer

You can use the following code to reproduce the issue: pti_sdk_openmp_world.zip

To run the example, use the following command:

$ source ~/Env/oneAPI.sh 
$ icpx main.cpp -fiopenmp -fopenmp-targets=spir64 -lpti -lpti_view
$ ./a.out

Environment

Thyre commented 8 months ago

Even with one OpenMP thread, we can run into the issue where the event order is somehow mixed up or timestamps make no sense.

[...]
[Score-P - 1] src/adapters/level0/scorep_level0_event_device.c:157: [428180] PTI-SDK Kernel 56 __omp_offloading_10303_1be4c1a__Z4main_l9 @ :0
Start = 1704810304039967747 -> End = 1704810304040177642 | Append = 1704810304039961263 | Submit = 1704810304039967747
[Score-P - 1] src/adapters/level0/scorep_level0_event_device.c:157: [428180] PTI-SDK Kernel 57 __omp_offloading_10303_1be4c1a__Z4main_l9 @ :0
Start = 1704810304040164963 -> End = 1704810304040168296 | Append = 1704810304040157397 | Submit = 1704810304040164963
[Score-P] src/measurement/scorep_location_management.c:455: Fatal: Bug 'timestamp < location->last_timestamp': Wrong timestamp order on location 2: 1704810304040177642 (last recorded) > 1704810304040164963 (current). This might be an indication of thread migration. Please pin your threads. Using a SCOREP_TIMER different from tsc might also help.

Since those two kernels run on the same context, device and command queue, shouldn't it be possible to get this result?

mschilling0 commented 8 months ago

RE: Per-thread buffer.

So, on applications with multiple threads, wouldn't operating on the buffer on a per-thread basis reduce the amount of synchronization required to insert records into the buffer? This simplifies the code and potentially increases the performance of the SDK (less time spent holding a lock).

However, I do understand that the records can and will be out of order in a lot of cases. However, are you suggesting we should guarantee the order in which records are returned to the user?

If we can guarantee the device / context are not shared across threads, I could see a similar thing working. However, we don't have a way of determining that from queue. Maybe a per-command list buffer could work?

From my understanding, each thread will execute events on a separate command queue, if possible. My question here is: Is it possible that command queues are used by multiple threads at the same time?

L0 says its valid to have multiple host threads sharing the same command queue https://spec.oneapi.io/level-zero/latest/core/PROG.html#command-queues.

Command queue submission is free-threaded, allowing multiple Host threads to share the same command queue.

Command list could work maybe?

There is no implicit binding of command lists to Host threads. Therefore, an application may share a command list handle across multiple Host threads. However, the application is responsible for ensuring that multiple Host threads do not access the same command list simultaneously.

But we would be restricted from access the buffer from outside of an operation on the command list.

Maybe you could flush buffers before the operation and after? Or maybe we could introduce our own "thread id" that monotonically increases and can be used from another container that guarantees an order when the buffers are flushed (like std::map)? a

RE: timestamp issue. Will look into it more, thanks!

Thyre commented 8 months ago

Thanks for the feedback.

You're absolutely right that recording in a per-thread buffer increases the performance of the SDK and reduces overhead significantly. My biggest concern with per-thread buffers is though, that one thread might contain a single event for one device and dispatches it at the end of the whole program execution for example. Since tool developers (ideally) do not want to leave this event out, tool developers need to think about how they want to handle the other events in the meantime. The logical solution would be to store all other events somewhere, but this increases memory demands and maybe requires evaluation to wait until the end of program execution before being able to process the events.

Switching to buffers based on contexts, devices or command queues / command lists increases the overhead, but improves the situation. On a command queue level I would expect to see the lowest additional overhead. It also matches the level where we are writing events to our internal locations. We could still run into issues with ordering when multiple threads write to the buffer. But this is the case for CUPTI as well, where the documentation mentions:

But CUPTI doesn't guarantee any ordering of the activities in the activity buffer as activity records for few activity kinds are added lazily.

Tool developers would have to take this into account and store the events per CUPTI stream temporarily until all gaps in the event IDs have been filled by outstanding buffers. But since we're working on a stream (CUPTI) / command queue (LEVEL ZERO) level, those events should be both more closely together and do not prevent other command queues from being processed, since they use distinct buffers.

I guess that command lists would work as well. If we receive a flush after a few command lists are finished, we could be sure that all events are there and only need to sort them. This could be done either by timestamps, or the _kernel_id and _mem_op_id field (sycl_runtime and overhead are missing this kind of id though).

I hope that this is understandable 😄

mschilling0 commented 6 months ago

Understood! I think this is a valid configuration option for PTI and your views on floating records are helpful.

And maybe we can even have less floating records with using the new queue_id provided by the compiler?

Putting it in our backlog and we'll keep this issue open until we make a decision if/when we should be implementing this.

jfedorov commented 2 weeks ago

Hi @Thyre. Thank you for submitting this issue. We implemented (at least partial) fix 948248583a143dff2c416be539a4abb62f3746fe. Now records populated into the buffer corresponding to the thread that created the operation (reported by the record). For example, all Sycl runtime operations and GPU device operations initiated by the same CPU thread will be found in one same buffer. The fix, however, introduced a global lock that diminishes the benefit of different buffers. This suggests that at some point we will return to this issue. Indeed, you suggested that buffers per context or queue. We are still evaluating alternatives - the things not simple in the presence of Sycl ops and Level-Zero ops, immediate command lists etc.

But at least for now it fixes the issue that you reported above (when device operation records submitted by different threads were found in one buffer). If it at some extent helps you to use PTI-SDK in Score-P - it would be great. If you have any other comments - please, share.

Thyre commented 2 weeks ago

Thanks a lot for the update @jfedorov. I'll check how these changes can improve using PTI-SDK for Score-P. It might take some time until I can give feedback due to other topics that need to be addressed first.