Open mehrhardt opened 7 years ago
I kind of get the feeling that I should write my own kernels for some functions I want to have very good performance for. Do you have any pointers for me on how to do this with ODL?
Sorry for the delay here.
Sadly there is basically two ways to do this:
Use pure c++ and make a PR here (then just follow what I've done so far). This would most likely be the fastest, but takes some time to do.
Use python and do some wrapping. I'd prefer using something like theano
or numba
.
Something like this should work:
import odl
import numba
import numba.cuda
import ctypes
import numpy as np
def as_numba_arr(el):
"""Convert ``el`` to numba array."""
gpu_data = numba.cuda.cudadrv.driver.MemoryPointer(
context=numba.cuda.current_context(),
pointer=ctypes.c_ulong(el.data_ptr),
size=el.size)
return numba.cuda.cudadrv.devicearray.DeviceNDArray(
shape=el.shape,
strides=(el.dtype.itemsize,),
dtype=el.dtype,
gpu_data=gpu_data)
# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
# Wrap
numba_el = as_numba_arr(el)
# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
target='cuda')
def maximum(a, b):
return max(a, b)
# Compute max(el, 3.0)
print(maximum(numba_el, 3.0))
This seems to be doing quite well:
spc = odl.rn(10**8, impl='cuda')
el = odl.phantom.white_noise(spc)
numba_el = as_numba_arr(el)
with odl.util.Timer('numpy'):
for i in range(100):
el2 = np.maximum(el, 3.0)
with odl.util.Timer('numba'):
for i in range(100):
el2 = maximum(numba_el, 3.0)
with odl.util.Timer('numba in place'):
for i in range(100):
maximum(numba_el, 3.0, out=numba_el)
Which gives:
numpy : 39.821
numba : 0.652
numba in place : 0.589
Which is a 67x speedup.
I now added util.as_numba_arr
in 5738eb687af804918c9c6c374e023bd16a59d095, the above code is then simply:
# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
# Wrap
numba_el = odlcuda.util.as_numba_arr(el)
# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
target='cuda')
def maximum(a, b):
return max(a, b)
# Compute max(el, 3.0) in place
print(maximum(numba_el, 3.0))
Great, I think I will first give the second option a shot.
Regarding the first one:
Use pure c++ and make a PR here (then just follow what I've done so far). This would most likely be the fastest, but takes some time to do.
I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction?
I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction?
Sure I mean PR, and there is this file:
https://github.com/odlgroup/odlcuda/blob/master/odlcuda/cuda/UFunc.cu
Random remark, this will be super-fast with the new GPU backend and ufuncs, in the order of 30 ms for the above code (speedup 1200x). Working on it.
That sounds amazing. Can't wait to have it! What is a realistic time frame that you need for this? Days, weeks, months?
Well, the big chunk of work for the CPU tensors is done and lying as a PR ready for review. Another PR which requires work in the order of weeks adds the GPU backend as full-value Numpy alternative. The only issue is that the current version of that backend has no native ufuncs, which is yet another PR I'm working on. Some issues to solve there, but probably also in the order of weeks (optimistic as I am :-))
Any news on this front @kohr-h, @adler-j ?
While your example works for me
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
numba_el = odlcuda.util.as_numba_arr(el)
doing the same with uniform_discr
does not:
spc = odl.uniform_discr([0], [1], [5], impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
numba_el = odlcuda.util.as_numba_arr(el)
File "
", line 1, in numba_el = odlcuda.util.as_numba_arr(el) File "/home/me404/.local/lib/python2.7/site-packages/odlcuda-0.5.0-py2.7.egg/odlcuda/util.py", line 37, in as_numba_arr pointer=ctypes.c_ulong(el.data_ptr),
AttributeError: 'DiscreteLpElement' object has no attribute 'data_ptr'
It seems the as_numba_arr
function needs to have a special case added, e.g. if isinstance(el, DiscreteLpElement): el = el.ntuple
, that should solve the problem.
Regarding updates we're working on it :)
Great, I am making progress on this end with numba. One proposal and one question:
What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?
Second, what about transfering the numba array back to ODL?
What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?
That would have to be something like asnumbacuda
or so, but why not!
Second, what about transfering the numba array back to ODL?
The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the __setitem__
function to work with numba, which would allow odl_el[:] = numba_el
to work
What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?
That would have to be something like asnumbacuda or so, but why not!
Maybe even better, what about el.asarray(impl='numbacuda')?
Second, what about transfering the numba array back to ODL?
The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the setitem function to work with numba, which would allow odl_el[:] = numba_el to work
Yes, this would be a good option.
Maybe even better, what about el.asarray(impl='numbacuda')?
Hm, I've always thought that asarray
is so darn unspecific about implementation, maybe we can exploit that now?
I agree with that asarray proposal, looks good and reduces clutter!
I also realized that this functionality is useful for the CPU, too! Not sure how much speed up to expect but numba allows you to compute things more elegant and memory efficient.
I noticed that the maximum function is very slow in odlcuda on the GPU. In fact, it is slower than computing the maximum on the CPU. Please see my example test case below. Any ideas why that is and how to fix it?