intel / fpga-runtime-for-opencl

Intel® FPGA Runtime for OpenCL™ Software Technology
Other
34 stars 68 forks source link

Device Global Copy Kernel Support #56

Open sherry-yuan opened 2 years ago

sherry-yuan commented 2 years ago

Theoretically it should be (given the lazy programming feature is available in runtime), but need to double check / at least layout how things should be called.

The precise questions are:

  1. Are there any issues in programming the FPGA, given a cl_program object?
  2. Any other foreseeable issues with launching the kernel from within the Runtime?
  3. What other information should the compiler provide through the autodiscovery string (now there is mapping of device_global name to pointer, all properties associated with device global)

CC: @zibaiwan @pcolberg @aditikum

sherry-yuan commented 2 years ago

Short Answers

  1. Are there any issues in programming the FPGA, given a cl_program object? (No issue)
  2. Any other foreseeable issues with launching the kernel from within the Runtime? (No issue that has no workaround as of now)
  3. What other information should the compiler provide through the autodiscovery string (now there is mapping of device_global name to pointer, all properties associated with device global) (see considerations below for things needed when launching a kernel).

Currently Drafted Solution

  1. Probably need to program, which we can borrow the logic from clCreateProgramWithBinaryAndProgramDeviceIntelFPGA (there should be no issue) https://github.com/intel/fpga-runtime-for-opencl/blob/950f21dd079dfd55a473ba4122a4a9dca450e36f/src/acl_program.cpp#L544-L597
  2. clCreateKernelIntelFPGA(program, name, errcode_ret)
  3. extract dest pointer from autodiscovery
  4. clSetKernelArgIntelFPGA (need to set src, dest, offset, size)
  5. clEnqueueTask
  6. Depending on whether blocking: wait/ not wait for event_wait_list

Considerations

  1. [This is impl detail, feel free to skip] Make sure there is no double lock when calling other opencl function, otherwise it will hung.
  2. Do we know if clEnqueueWirteBuffer is called before passing buffer into the function. (although I am not too worried about it given it will be enqueued anyways when enqueue kernel: l_copy_and_adjust_arguments_for_device)
  3. The runtime is not going to know the device memory interface through the platform global var without receiving aocx. That's at least the case for simulation flow.
  4. Above potentially bring a problem when we need to allocate buffer for the buffer that is passed in (and especially when that buffer is targeted at multi-memory system, since memory operation do not have direct access to autodiscovery), depending on whether buffer allocation is deferred.
  5. What was the consideration of not reusing l_enqueue_mem_transfer? https://github.com/intel/fpga-runtime-for-opencl/blob/d9df7a9ed68d1343342666a0466d154561599a1a/src/acl_mem.cpp#L4706-L5154
  6. If buffer alloc is deferred / not yet allocated, then enqueueing this kernel will result in an extra copy that we can avoid. i.e we can directly copy from buffer's host to device_global rather than host -> device -> device global. But maybe we don't care about performance.

Resources

  1. Implementation Doc
  2. OpenCL Spec
  3. Sycl Spec
sherry-yuan commented 2 years ago

Summary

This is the summary version of above comment, see above for more detail. Feel free to comment if there is anything missed.

Short Answers

  1. Are there any issues in programming the FPGA, given a cl_program object? (No issue)
  2. Any other foreseeable issues with launching the kernel from within the Runtime? (No issue as of now)
  3. What other information should the compiler provide through the autodiscovery string (now there is mapping of device_global name to pointer, all properties associated with device global) (this probably doesn't matter for mem copy kernel: launching a kernel also requires arguments for workgroup size etc).

Launch kernel arguments: l_enqueue_kernel_with_type(commandqueue, kernel, ?workdim, _?global_workoffset, _?global_worksize, _?local_worksize, num_events_in_wait_list, event_waitlist, event, ?CL_COMMAND_MIGRATE_MEMOBJECTS);

Always pass in constant for size and offsets?

Main Questions

  1. What was the consideration of not reusing l_enqueue_mem_transfer (is it because dest is not buffer)? https://github.com/intel/fpga-runtime-for-opencl/blob/d9df7a9ed68d1343342666a0466d154561599a1a/src/acl_mem.cpp#L4706-L5154
  2. If buffer alloc is deferred / not yet allocated, then enqueueing this kernel will result in an extra copy that we can avoid. i.e we can directly copy from buffer's host to device_global rather than host -> device -> device global. But maybe we don't care about performance.
sherry-yuan commented 2 years ago

Edit: Discard this comment as there is a better solution below provided by Artem.

In regards to the question of how work_size (both local and global should be decided).

  1. If workgroup size is known at compile time instead of run time, simpler hardware is generated.
  2. work group size is fixed, so number of work groups actually determines the total amount of work. This is limited by the size of each global memory.

The current proposed solution for determining workgroup sizes:

  1. Get work group size from the autodiscovery
  2. get pseudochannel size from auto discovery

Given the board and from autodiscovery or a constant in runtime (as long it is determined during compile time). And each work item writes that is bytes vectorized by Then each workgroup's size is = . need space for input and output buffers which two separate buffers both within one global memory address space. So size of 1 of these buffers is therefore / 2.

Then the formula for determining number of work groups constrained by size of global memory is: = ( / 2) /

sherry-yuan commented 2 years ago

Thanks Artem @artemrad for the info! There is no need for clEnqueueNDRange. clEnqueueTask will do what we wanted (without the need to know workgroup size)

Precise answer below """ In general you do no need WG size. Launch this kernel as a task, rather than a NDRange. So default to {1, 1, 1} for WG size and WI sizes; More specifically do what you would do if a kernel was launched with clEnqueueTask() instead of clEnqueueNDRangeKernel() """

sherry-yuan commented 2 years ago

Push to 2022.4 given that's the overall new target. Next steps, run sycl l3 set on the device global change with autodiscovery change, merge in the autodiscovery+runtime change. May depend on pushing specs in.