inducer / pyopencl

OpenCL integration for Python, plus shiny features
http://mathema.tician.de/software/pyopencl
Other
1.06k stars 241 forks source link

Is PyopenCL slower than its counterpart C OpenCL #499

Closed doetools closed 2 years ago

doetools commented 3 years ago

Hi PyOpenCL developers and experts,

I do not see a PyOpenCL forum online, so I post my question here. I am working on a project to port my C OpenCL code to PyOpenCL. Is the speed of executing opencl kernels slowers in PyOpenCL than in C?

My tests show that my pyopencl is like 10-20% slower than c opencl. I read a post online saying pyopencl is like 28 times slower than c.

Thanks!

inducer commented 3 years ago

PyOpenCL is not involved in the execution of the kernels. The actual on-device kernel execution time should be exactly the same between a C program using OpenCL and a Python program using PyOpenCL. If that's not the case, something is very wrong.

This means that PyOpenCL is a good fit if your code spends most of its time executing kernel code (IMO: as it should!). If the kernel times are so short that kernels complete faster than Python can enqueue them, then there can be a performance impact.

doetools commented 3 years ago

@inducer

my application spends most of the time on device (it is called computational fluid dynamics, in case you wonder 😄 ). I only counted the time the kernel is executed. To be specific,

    kernels[1].set_args(...)
    # enqueue
    cl.enqueue_nd_range_kernel(queue, kernels[1],...)

I compiled all my kernels from program and save them into a list kernels.

Interestingly, I got same results between C OpenCL and PyOpenCL. So, I do not think I made any programming mistakes. But I might not use the PyOpenCL in the way it is supposed to.

Could you please comment on which part I should examine my code?

P.S. would it make any difference if i do the following

kernels[1](queue, group_size,...)
inducer commented 3 years ago

Are you saying your kernel times are different? Switch your command queue to enable profiling and get kernel execution times in both settings. They should match pretty closely.

doetools commented 3 years ago

@inducer my bad. I think I confused kernel time and execution time. So, according to your experience, is the kernel time (time cost of set args, enqueue, and execution) of PyOpenCL slower (longer) than C OpenCL?

inducer commented 3 years ago

The Python bits (setting and preparing arguments) is slower, but this time can (and should be) hidden by kernel execution, which occurs asynchronously on the device.

doetools commented 3 years ago

The Python bits (setting and preparing arguments) is slower, but this time can (and should be) hidden by kernel execution, which occurs asynchronously on the device.

Make sense!

doetools commented 3 years ago

@inducer consider the following code

queue = cl.Queue(...)
t1 = time.time()
kernels[1].set_args(...)
# enqueue
event = cl.enqueue_nd_range_kernel(queue, kernels[1],...)
event.wait()
t2 = time.time()

total_time= t2-t1
execution_time= (event.profile.end-event.profile.start)*1e-9 

I searched online and found that the difference between total_time and execution_time is due to data transfer. What exactly does data transfer mean given that we have already created the buffers.

Is there any way we can reduce this data-transfer time? I have a while loop, in which I execute the above code for thousands of times.

Here is an example code. On my computer, the total time cost is about 0.004 seconds while the execution time cost is 0.0007 seconds. This seems indicating the most of the time is spent on knl.set_args and cl.enqueue_nd_range_kernel. Any idea how to speed up that?

import numpy as np
import pyopencl as cl
import time

# get platforms and devices
platforms = cl.get_platforms()
devices = platforms[0].get_devices()
# create context
context = cl.Context(
    devices=[devices[0]], properties=[(cl.context_properties.PLATFORM, platforms[0])],
)
program = cl.Program(
    context,
    """
__kernel void sum_matrices(
    __global const float *a_g, __global const float *b_g, __global float *res_g)
{
  int gid = get_global_id(0);
  res_g[gid] = a_g[gid] + b_g[gid];
}
""",
).build()
# create queue
queue = cl.CommandQueue(
    context, properties=cl.command_queue_properties.PROFILING_ENABLE
)

a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)
mf = cl.mem_flags
a_g = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(context, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)

res_g = cl.Buffer(context, mf.WRITE_ONLY, a_np.nbytes)
# knl = program.sum_matrices
knl = getattr(program, "sum_matrices")
index = 0
start = time.time()
execution_time = 0.0
while index < 100:
    knl.set_args(a_g, b_g, res_g)
    event = cl.enqueue_nd_range_kernel(queue, knl, a_np.shape, None)
    event.wait()
    execution_time += (event.profile.end - event.profile.start) * 1e-9
    # print(f"\t execution: {execution_time}")
    index += 1
end = time.time()
print(f"execution time cost: {execution_time}")
print(f"total time cost: {end - start}")
isuruf commented 3 years ago

This seems indicating the most of the time is spent on knl.set_args and cl.enqueue_nd_range_kernel. Any idea how to speed up that?

Not really. If you do the following, you'll see that there's a difference even without those. I'd guess the difference is in the time it takes for the opencl implementation to schedule and run the kernel. Do you have a equivalent C function that reproduces this?

start = time.time()
execution_time = 0.0
events = []
while index < 100:
    knl.set_args(a_g, b_g, res_g)
    event = cl.enqueue_nd_range_kernel(queue, knl, a_np.shape, None)
    events.append(event)
    index += 1

end = time.time()
print(f"total setup cost: {end - start}")

start = time.time()
for event in events:    
    event.wait()
    execution_time += (event.profile.end - event.profile.start) * 1e-9
end = time.time()
print(f"execution time cost: {execution_time}")
print(f"total time cost: {end - start}")