intel / pti-gpu

Profiling Tools Interfaces for GPU (PTI for GPU) is a set of Getting Started Documentation and Tools Library to start performance analysis on Intel(R) Processor Graphics easily
MIT License
202 stars 57 forks source link

[PTI-SDK] Memory copy record does not contain copied size #47

Closed Thyre closed 11 months ago

Thyre commented 1 year ago

Description

Recently, a proof of concept version for the PTI-SDK was added to this repository. I've had a look at it and tried to implement some basic handling of the interface to check if it is usable for us right now.

Right now, the field pti_view_record_memory_copy is used to describe copy actions between different types of memory (memory to device, shared memory to device etc.). The struct itself has many descriptive fields. However, I'm noticing that one crucial one is missing: The amount of memory we're actually copying!

Lets look at the struct:

typedef struct pti_view_record_memory_copy {
  pti_view_record_base _view_kind;          //!< Base record
  pti_view_memcpy_type _memcpy_type;        //!< Memory copy type
  pti_view_memory_type _mem_src;            //!< Memory type
  pti_view_memory_type _mem_dst;            //!< Memory type
  ze_command_queue_handle_t _queue_handle;  //!< Device back-end queue handle
  ze_device_handle_t  _device_handle;       //!< Device handle
  ze_context_handle_t _context_handle;      //!< Context handle
  const char* _name;                        //!< Back-end API name making a memory copy
  char _pci_address[16];                    //!< Device pci_address
  uint64_t _mem_op_id;                      //!< Memory operation ID, unique among
                                            //!< all memory operations instances
  uint32_t _correlation_id;                 //!< ID that correlates this record with records
                                            //!< of other Views
  uint32_t _thread_id;                      //!< Thread ID from which operation submitted
  uint64_t _append_timestamp;               //!< Timestamp of memory copy appending to
                                            //!< back-end command list, ns
  uint64_t _start_timestamp;                //!< Timestamp of memory copy start on device, ns
  uint64_t _end_timestamp;                  //!< Timestamp of memory copy completion on device, ns
  uint64_t _submit_timestamp;               //!< Timestamp of memory copy command list submission
                                            //!< to device, ns
} pti_view_record_memory_copy;

There's certainly some stuff not needed here (like PCI address and so on, which, if needed, could be returned by a separate function), but this is noted in the TODO already. The amount of memory is certainly crucial for some applications to potentially show bottlenecks. Just showing the time passed is not sufficient and having a separate record / callback for that would also be inconvenient.

Just as an addition, memory allocations and deletions are also not shown, but I guess this is because the Level0 part of the PTI-SDK is still missing.

maaswani commented 1 year ago

Hello @Thyre --- good catch on the missing bytes_copied.

On Level0 part (mem allocs/deletes) can you expand a bit on what you mean on the missing portion ?

Thyre commented 1 year ago

Sure! Looking at pti_view.h, we can see that there's no record type for allocations / deletions. This is also the case for OMPT for example, where no explicit callback / record for memory handling exists. However, the ompt_callback_target_data_op is reused for this, having optype values for allocations and deletions.

Since pti_view_kind also has a value for Level Zero with PTI_VIEW_LEVEL_ZERO_CALLS, my guess would be that allocations / deletions might get included there, since functions like zeMemAllocShared represent memory handling. However, PTI_VIEW_LEVEL_ZERO_CALLS is not implemented yet, resulting in PTI_ERROR_NOT_IMPLEMENTED if one tries to register this group. https://github.com/intel/pti-gpu/blob/90b9230c8bd9c00211934ec5e36183edc3aa8c1d/sdk/src/view_handler.h#L244

So my plan would be to wait until a first proof of concept of that part is implemented as well :smile:
If it's still missing then, or the implementation is not easy to work with, I can give additional feedback. I also asked colleagues to maybe have a look at the PTI-SDK PoC.

jfedorov commented 11 months ago

Hi @Thyre , number of bytes copied added https://github.com/intel/pti-gpu/blob/f27829883defd69c18e7dc2e4cdd727a2483c40c/sdk/include/pti_view.h#L188

Although PTI_VIEW_LEVEL_ZERO_CALLS is still to be done. It is on our list. thank you!

Thyre commented 11 months ago

This looks good, thanks :smile: Since this issue focuses on the _bytes field only, I would close the issue if you're fine with it.

For PTI_VIEW_LEVEL_ZERO_CALLS, a separate issue can be opened if needed. I'm also trying to build an extremely early prototype during spare cycles, so that I can give additional feedback what we might need. However, I expect that it will take quite some time to get a first prototype.

jfedorov commented 11 months ago

Thank you! yes - it would be good to close this issue. And open on PTI_VIEW_LEVEL_ZERO_CALLS. last month was a lot about infrastructure work and publishing new tool - unitrace, we will invest more into the SDK features going forward.