oneapi-src / level-zero

oneAPI Level Zero Specification Headers and Loader
https://spec.oneapi.com/versions/latest/elements/l0/source/index.html
MIT License
208 stars 90 forks source link

[Question] does a copy op create a command queue implicitly #85

Closed abagusetty closed 2 years ago

abagusetty commented 2 years ago

Would invoking a zeCommandListAppendMemoryCopy create an additional command queue (i.e., call to zeCommandQueueCreate) implicitly ?

jandres742 commented 2 years ago

hi @abagusetty

No. It is expected the user creates a command queue to which the list is later send for execution. The only time a queue is created implicitly is when using an immediate list. In that case, the immediate list acts as both list and queue, so just creating the immediate list and then appending the copy operation is enough to offload the copy to the accelerator.

abagusetty commented 2 years ago

Hey @jandres742

thanks for your response. I was mostly profiling an app using onetrace and have created a reproducer shown below. In short, the reproducer creates 2 queues and a sycl::memcpy.

Adding a memcpy op creates a commandqueue and was just curious if this is expected for all cases. From your response above, does the code below intends to create an immediate list and hence implicitly creating a cmd queue.

Some of the profiling output is truncated.

#include <sycl/sycl.hpp>

int main(int argc, char **argv)
{
  std::vector<sycl::device> allDevices = sycl::device::get_devices(sycl::info::device_type::gpu);
  auto subDevices = allDevices[0].create_sub_devices<sycl::info::partition_property::partition_by_affinity_domain>(sycl::info::partition_affinity_domain::numa);
  sycl::device *targetDev = new sycl::device(subDevices[0]);
  sycl::context *targetContext = new sycl::context(*targetDev);

  const int nQueues = 2;
  const int n = targetDev->get_info<sycl::info::device::max_mem_alloc_size>() / (2*sizeof(float));
  std::cout << "n : " << n << ", (GBs) : " << n*sizeof(float) * 1.0e-09 << std::endl;

  // create events and queues
  sycl::queue *queue[nQueues];
  for (int i = 0; i < nQueues; ++i) {
    queue[i] = new sycl::queue(*targetContext, *targetDev, sycl::property_list{sycl::property::queue::in_order{}});
  }

  // allocate host memory and device memory
  float *h_a   = new float[n];
  float *d_a = sycl::malloc_device<float>(n, *targetDev, *targetContext); // device

  queue[0]->memcpy(d_a, h_a, n*sizeof(float));

  return 0;
}
Before (without `sycl::memcpy`) Function, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns) zeMemAllocDevice, 1, 20804520, 94.12, 20804520, 20804520, 20804520 zeCommandListCreateImmediate, 1, 964433, 4.36, 964433, 964433, 964433 zeCommandQueueCreate, 2, 314511, 1.42, 157255, 154330, 160181
After (with `sycl::memcpy`) Function, Calls, Time (ns), Time (%), Average (ns), Min (ns), Max (ns) zeCommandListAppendMemoryCopy, 1, 247365887, 90.94, 247365887, 247365887, 247365887 zeMemAllocDevice, 1, 20889025, 7.68, 20889025, 20889025, 20889025 zeCommandQueueExecuteCommandLists, 1, 1777965, 0.65, 1777965, 1777965, 1777965 zeCommandListCreateImmediate, 1, 995988, 0.37, 995988, 995988, 995988 zeCommandQueueCreate, 3, 481460, 0.18, 160486, 151290, 166893 zeCommandListCreate, 1, 356701, 0.13, 356701, 356701, 356701 zeEventCreate, 1, 48204, 0.02, 48204, 48204, 48204 zeInit, 1, 28380, 0.01, 28380, 28380, 28380 zeFenceCreate, 1, 19280, 0.01, 19280, 19280, 19280 zeDriverGetProperties, 1, 17108, 0.01, 17108, 17108, 17108 zeEventPoolCreate, 1, 11085, 0.00, 11085, 11085, 11085
jandres742 commented 2 years ago

@abagusetty thanks. Your example is a DPC++ code, so how/if the queue is created depends on how DPC++ and SYCL runtime implements the code with L0 calls.

So, in this case, your original question should be rephrased from

Would invoking a zeCommandListAppendMemoryCopy create an additional command queue

to

Would invoking a sycl::memcpy create an additional command queue

from the point of view of L0, an L0 queue needs to be created to append a zeCopy operation that has been appended to a list, or alternatively, the zeCopy needs to go to an immediate command list. Which of the two the code above ends up using, would depend on how they implement and translate the code.

I would suggest to move this issue to https://github.com/intel/llvm, where they can exactly the process they use to create L0 queues from DPC++ code like the one you shared above.,

abagusetty commented 2 years ago

Thanks @jandres742. Closing the issue here and moving to intel/llvm