slow maximum function - Githubissues

mehrhardt commented 7 years ago

I noticed that the maximum function is very slow in odlcuda on the GPU. In fact, it is slower than computing the maximum on the CPU. Please see my example test case below. Any ideas why that is and how to fix it?

import odl

X = odl.rn(300 * 10**6, dtype='float32')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

X = odl.rn(300 * 10**6, dtype='float32', impl='cuda')
x = 0.5 * X.one()
y = X.one()
%time x.ufuncs.maximum(1, out=y)
%time x.ufuncs.log(out=y)

CPU times: user 346 ms, sys: 200 µs, total: 346 ms
Wall time: 347 ms
CPU times: user 1.44 s, sys: 0 ns, total: 1.44 s
Wall time: 1.43 s
CPU times: user 838 ms, sys: 341 ms, total: 1.18 s
Wall time: 1.18 s
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 91.1 µs

mehrhardt commented 7 years ago

I kind of get the feeling that I should write my own kernels for some functions I want to have very good performance for. Do you have any pointers for me on how to do this with ODL?

adler-j commented 7 years ago

Sorry for the delay here.

Sadly there is basically two ways to do this:

Use pure c++ and make a PR here (then just follow what I've done so far). This would most likely be the fastest, but takes some time to do.
Use python and do some wrapping. I'd prefer using something like theano or numba.

Something like this should work:

import odl
import numba
import numba.cuda
import ctypes
import numpy as np

def as_numba_arr(el):
    """Convert ``el`` to numba array."""
    gpu_data = numba.cuda.cudadrv.driver.MemoryPointer(
        context=numba.cuda.current_context(),
        pointer=ctypes.c_ulong(el.data_ptr),
        size=el.size)

    return numba.cuda.cudadrv.devicearray.DeviceNDArray(
        shape=el.shape,
        strides=(el.dtype.itemsize,),
        dtype=el.dtype,
        gpu_data=gpu_data)

# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])

# Wrap
numba_el = as_numba_arr(el)

# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
                 target='cuda')
def maximum(a, b):
    return max(a, b)

# Compute max(el, 3.0)
print(maximum(numba_el, 3.0))

This seems to be doing quite well:

spc = odl.rn(10**8, impl='cuda')
el = odl.phantom.white_noise(spc)
numba_el = as_numba_arr(el)

with odl.util.Timer('numpy'):
    for i in range(100):
        el2 = np.maximum(el, 3.0)

with odl.util.Timer('numba'):
    for i in range(100):
        el2 = maximum(numba_el, 3.0)

with odl.util.Timer('numba in place'):
    for i in range(100):
        maximum(numba_el, 3.0, out=numba_el)

Which gives:

                         numpy :     39.821 
                         numba :      0.652 
                numba in place :      0.589

Which is a 67x speedup.

adler-j commented 7 years ago

I now added util.as_numba_arr in 5738eb687af804918c9c6c374e023bd16a59d095, the above code is then simply:

# Create ODL space
spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])

# Wrap
numba_el = odlcuda.util.as_numba_arr(el)

# Define kernel using numba
@numba.vectorize(['float32(float32, float32)'],
                 target='cuda')
def maximum(a, b):
    return max(a, b)

# Compute max(el, 3.0) in place
print(maximum(numba_el, 3.0))

mehrhardt commented 7 years ago

Great, I think I will first give the second option a shot.

Regarding the first one:

Use pure c++ and make a PR here (then just follow what I've done so far). This would most likely be the fastest, but takes some time to do.

I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction?

adler-j commented 7 years ago

I am not sure I follow completely. With PR you mean Pull Request, right? Where do I find what you have done so far in this direction?

Sure I mean PR, and there is this file:

https://github.com/odlgroup/odlcuda/blob/master/odlcuda/cuda/UFunc.cu

kohr-h commented 7 years ago

Random remark, this will be super-fast with the new GPU backend and ufuncs, in the order of 30 ms for the above code (speedup 1200x). Working on it.

mehrhardt commented 7 years ago

That sounds amazing. Can't wait to have it! What is a realistic time frame that you need for this? Days, weeks, months?

kohr-h commented 7 years ago

Well, the big chunk of work for the CPU tensors is done and lying as a PR ready for review. Another PR which requires work in the order of weeks adds the GPU backend as full-value Numpy alternative. The only issue is that the current version of that backend has no native ufuncs, which is yet another PR I'm working on. Some issues to solve there, but probably also in the order of weeks (optimistic as I am :-))

mehrhardt commented 6 years ago

Any news on this front @kohr-h, @adler-j ?

mehrhardt commented 6 years ago

While your example works for me

spc = odl.rn(5, impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
numba_el = odlcuda.util.as_numba_arr(el)

doing the same with uniform_discr does not:

spc = odl.uniform_discr([0], [1], [5], impl='cuda')
el = spc.element([1, 2, 3, 4, 5])
numba_el = odlcuda.util.as_numba_arr(el)

File "", line 1, in numba_el = odlcuda.util.as_numba_arr(el)

File "/home/me404/.local/lib/python2.7/site-packages/odlcuda-0.5.0-py2.7.egg/odlcuda/util.py", line 37, in as_numba_arr pointer=ctypes.c_ulong(el.data_ptr),

AttributeError: 'DiscreteLpElement' object has no attribute 'data_ptr'

adler-j commented 6 years ago

It seems the as_numba_arr function needs to have a special case added, e.g. if isinstance(el, DiscreteLpElement): el = el.ntuple, that should solve the problem.

Regarding updates we're working on it :)

mehrhardt commented 6 years ago

Great, I am making progress on this end with numba. One proposal and one question:

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?

Second, what about transfering the numba array back to ODL?

adler-j commented 6 years ago

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?

That would have to be something like asnumbacuda or so, but why not!

Second, what about transfering the numba array back to ODL?

The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the __setitem__ function to work with numba, which would allow odl_el[:] = numba_el to work

mehrhardt commented 6 years ago

What about we write the numba transfer as element.asnumba() to be more in line with el.asarray()?
That would have to be something like asnumbacuda or so, but why not!

Maybe even better, what about el.asarray(impl='numbacuda')?

Second, what about transfering the numba array back to ODL?
The current implementation is copy-less, in that the created numba array is simply a view of the ODL array. To copy back properly, I think the only "simple" option would be to first convert the numba array to an array, and then assign this to the ODL vector. Somewhat more advanced, we could override the setitem function to work with numba, which would allow odl_el[:] = numba_el to work

Yes, this would be a good option.

kohr-h commented 6 years ago

Maybe even better, what about el.asarray(impl='numbacuda')?

Hm, I've always thought that asarray is so darn unspecific about implementation, maybe we can exploit that now?

adler-j commented 6 years ago

I agree with that asarray proposal, looks good and reduces clutter!

mehrhardt commented 6 years ago

I also realized that this functionality is useful for the CPU, too! Not sure how much speed up to expect but numba allows you to compute things more elegant and memory efficient.

odlgroup / odlcuda

slow maximum function #24