[QST] Is lazy evaluation used ?

flytrex-vadim commented 4 years ago

I'm trying to do a simple benchmarking of cuSignal vs signal on correlation. It seems that the correlate2d completes immediately, and the buffer transfer takes 2.8 seconds. Makes no sense for buffer size of 80 MBytes. Could it be the correlation is evaluated lazily only when buffer transfer is requested ?

%time signal_t_gpu = cp.asarray(signal_t)
%time pulse_t_2d_gpu = cp.asarray(pulse_t_2d)

%time corr_cusig = cusignal.correlate2d(signal_t_gpu, pulse_t_2d_gpu, mode='valid')

%time corr_cusig_np = corr_cusig.get()
%time corr_cusig_np2 = cp.asnumpy(corr_cusig)
%time corr_cusig_np3 = cp.asnumpy(corr_cusig)

and getting :

Wall time: 12 ms
Wall time: 996 µs
Wall time: 0 ns
Wall time: 2.79 s
Wall time: 30 ms
Wall time: 30 ms

awthomp commented 4 years ago

Hey @flytrex-vadim -- thanks for asking a question. @mnicely is our benchmark and performance guru, but a couple of observations:

We're currently working on helper code to pre-compile and pre-cache these CUDA kernels. As it stands now, if you don't explicitly declare to do so, part of the processing time for first invocation on a cusignal function consists of that kernel compilation.
I'm not an expert in the memory management and migration aspect of CuPy, but not only are you invoking the CPU -> GPU transfer with cp.asarray(cpu_based_signal), you're also facing the overheads of things like cudaMalloc under the hood.

You can directly generate data on the GPU with CuPy with something like

signal_t = cp.random.rand(10_000_000, dtype=cp.float32)

Matt -- do you also mind posting an example on how to enable kernel pre-compilation and caching for correlate2d?

flytrex-vadim commented 4 years ago

Hi @awthomp , thanks for answering. Few comments:

My data is being loaded from external file, so generating on GPU is of little help (I'm benchmarking reference dataset while validating correctness and accuracy)
It seems to me that some synchronization primitive/barrier is missing. Here's another test with 1D correlate, where the result depends on the method used. With FFT method the time is spent in correlate method, while with DIRECT, the blocking seems to occur in get():

%time corr_0_cusig = cusignal.correlate(sig_0, pulse, mode='valid', method='direct')
Wall time: 1e+03 µs
%time corr_0_cusig_cpu = corr_0_cusig.get()
Wall time: 22 ms

# ==============================

%time corr_0_cusig = cusignal.correlate(sig_0, pulse, mode='valid', method='fft')
Wall time: 20 ms
%time corr_0_cusig_cpu = corr_0_cusig.get()
Wall time: 1 ms

In fact this lack of blocking seems to prevents DIRECT method from being used in a loop.

It seems I'm doing something wrong because I'm getting much better results on CPU with vanilla scipy.signal

mnicely commented 4 years ago

Hi @flytrex-vadim,

I will try to recreate your scenario early next week and check if I'm missing any blocking.

In the meantime, can you use %timeit instead of %time. We've found it provides miss leading results with GPU profiling.

An even better way would be to use CuPy's NVTX markers.

from cupy import prof

@cp.prof.TimeRangeDecorator()
def test_baseline():

    h_a = np.ones(size, np.int)
    h_b = np.ones(size, np.int)

And profile with Nsight Systems

nsys profile --sample=none --trace=cuda,nvtx --stats=true python3 <python script>

To precompile the kernels try

cusignal._signaltools.precompile_kernels( [np.float32],
        [GPUBackend.CUPY], [GPUKernel.CORRELATE],)

awthomp commented 4 years ago

Hey @flytrex-vadim -- what's the size and dtypes for sig_0 and pulse in your example? If you have any code you feel comfortable sharing, that might give us a better idea of what's going on.

One way to avoid some of the overhead in the CPU -> GPU transfer is to make use of our shared memory function that removes pages from being swapped by the OS (pinned) and then virtually addressed by the GPU (mapped). Here's an example with the polyphase resampler, but this basically creates a DMA between CPU and GPU.

Be careful how much memory you allocate here though -- as you can easily cause a kernel panic if you allocate too much.

import cupy as cp
import numpy as np
import cusignal

start = 0
stop = 10
num_samps = int(1e8)
resample_up = 2
resample_down = 3

# Generate Data on CPU
cx = np.linspace(start, stop, num_samps, endpoint=False) 
cy = np.cos(-cx**2/6.0)

# Create shared memory between CPU and GPU and load with CPU signal (cy)
gpu_signal = cusignal.get_shared_mem(num_samps, dtype=np.float64)

%%time
# Move data to GPU/CPU shared buffer and run polyphase resampler
gpu_signal[:] = cy
gf = cusignal.resample_poly(gpu_signal, resample_up, resample_down, window=('kaiser', 0.5))

flytrex-vadim commented 4 years ago

Thanks guys,

the array shapes I'm using for 2D : signal (181, 55000) pulse (1, 14999) correlation (181, 40002) with 1D case being row slices of the same. Tried both float64 and float32

Here's the snapshot of my experimental notebook: correlate_ipynb.zip

I'll try looking into shared memory and other timing mechanisms

flytrex-vadim commented 4 years ago

btw, speaking of shared memory, how would I pass the allocated shared shared memory buffer to function to be used for output ?

awthomp commented 4 years ago

btw, speaking of shared memory, how would I pass the allocated shared shared memory buffer to function to be used for output ?

Hey @flytrex-vadim -- once you've allocated the shared memory buffer and loaded it with data, you can use it alike any normal CuPy/cuSignal array.

For example, above:

# Create shared memory between CPU and GPU. This is like `numpy.zeros` and basically creates an
# empty memory slot for `num_samps` of `np.float64` data. Remember, the GPU and CPU can access
# this memory block, so you could run both numpy/scipy and cupy/cusignal calls on it.
gpu_signal = cusignal.get_shared_mem(num_samps, dtype=np.float64)

# Now migrate data into the empty buffer. In the case of your file read, you'd read your file into
# this newly created buffer.
gpu_signal[:] = cy

# Perform cusignal/cupy (or scipy/numpy) function on this `gpu_signal`. It's now an allocated array;
# the only difference is, again, it can be used for GPU and CPU processing. 
gf = cusignal.resample_poly(gpu_signal, resample_up, resample_down, window=('kaiser', 0.5))

The way CuPy/cuSignal migrate data is basically via cp.asarray. This functionality migrates data from CPU to GPU if and only if that data isn't already detected on GPU. Since you'd be calling a function with an array that's already mapped to GPU, cuSignal knows not to migrate that data OR to migrate that data, but since a direct path between GPU and CPU has been established, you don't invoke the penalty of having an extra copy via the bounce buffer.

flytrex-vadim commented 4 years ago

Hi @awthomp , I've payed with variations of the example above. However they seem to address only the in parameters - the parameters passed to the function. My question was referring to the out parameter - the returned values. I would assume the returned array would be (by default) allocated on the GPU and would need to be transferred to the CPU accessible memory.

And I do not see any way to control the allocation of the return buffer.

awthomp commented 4 years ago

This is a good point @flytrex-vadim, and it's been suggested on another thread -- basically that we can ensure memory external to cusignal functions is zero-copy, but it's all abstract internal to the function. If we create some internal array, for example, can we have that be zero copy too? Further -- you're correct; all output is assumed to be on GPU, and there's not currently a feature to return an array that's already been transferred to the host; or, for that matter -- making the output array be zero-copy rather than a standard CuPy array.

I'll file an issue about this in the next few days and point you to the conversation.

awthomp commented 4 years ago

@flytrex-vadim -- I created a feature request addressing one of your comments here: https://github.com/rapidsai/cusignal/issues/76. Let's move discussion there.

Do you mind if I close this issue?

flytrex-vadim commented 4 years ago

I think the two original questions remain:

is there a missing wait or block in the 'direct' mode of correlation that would cause the function to return before the calculation is complete ?
is there a particular performance bottleneck in the array sizes I was using (signal:55000, pulse:14999) that would result in particularly slow execution, data transfer excluded ? Do I need to do zero padding to optimal size, or it happens automagically ?

awthomp commented 4 years ago

Hey @flytrex-vadim. I was working out of your notebook and have a few comments observations:

It looks like your 1D correlation example is pitting scipy.signal.correlate vs cusignal.correlate with input data sizes for sig_0 at (55000,) and pulse (14999). Here, I see similar behavior on a P100 as you: Scipy Signal is outperforming cuSignal on these data sizes.
For your 2D correlation, I see that you're benchmarking scipy.signal.correlate2d with input data sizes (181, 55000) and (1, 14000) but are seemingly doing the 2D correlation with cusignal row by row? Is there a reason you didn't use cusignal.correlate2d here? On my P100, this code finishes in ~55ms.

To directly address your points:

is there a missing wait or block in the 'direct' mode of correlation that would cause the function to return before the calculation is complete ?

CuPy launches asynchronously, but think we've been effectively blocking before results are returned. You can always add a numba.cuda.synchronize() to confirm. Let me know if this sync fixes your issue.

is there a particular performance bottleneck in the array sizes I was using (signal:55000, pulse:14999) that would result in particularly slow execution, data transfer excluded ? Do I need to do zero padding to optimal size, or it happens automagically ?

I can confirm the perf here. We can profile this specific usecase, but I'm curious if this just isn't enough data to see the perf improvement we're used to.

flytrex-vadim commented 4 years ago

Is there a reason you didn't use cusignal.correlate2d here? On my P100, this code finishes in ~55ms.

This is very interesting. On my GTX 1050 Ti it takes 1.4 seconds. I was trying both correlate2d and line-by-line option to see if there's any speedup in one of the options (1.4s line-by-line and 2.8s with correlate2d)

flytrex-vadim commented 4 years ago

You can always add a numba.cuda.synchronize() to confirm. Let me know if this sync fixes your issue.

Yes, I can confirm that adding synchronize call waits for the operation to complete (tested with correlated2d)

awthomp commented 4 years ago

Is there a reason you didn't use cusignal.correlate2d here? On my P100, this code finishes in ~55ms.

This is very interesting. On my GTX 1050 Ti it takes 1.4 seconds. I was trying both correlate2d and line-by-line option to see if there's any speedup in one of the options (1.4s line-by-line and 2.8s with correlate2d)

You can always run our benchmark tests on your 1050 TI and let us know what you see.

From HEAD/python run: pytest -v --benchmark-only -k correlate2d.

mnicely commented 4 years ago

@flytrex-vadim cusignal functions are nonblocking by design. The same way a C++ CUDA kernel is nonblocking. And if you don't pass a non-default CuPy stream everything is launched in the default stream, which is blocking.

So a kernel launch is non-blocking, but it launches in a stream that is blocking??? Yes, it can be a little confusing but in a heterogeneous system it means that the host code launch work (e.g. kernel) and then control returns to the host. Therefore, host code and device code can run asynchronously. Since the default stream is blocking if you were to run a blocking call like a cudaMemcpy, the host code will be blocked until the copy is finished.

So lets think about that for a second... When you run %time on individual calls like correlate2d ("direct") which a bulk of the compute is on the GPU, as soon as you launch the kernel execution flow returns to the CPU and your stop watch records a time. If you look at the source code you'll see there's a few more calls in ("fft").

Attached is sample code using our NVTX markers. We use Nsight Systems to profile the code.

nsys profile --sample=none --trace=cuda,nvtx --stats=true python3 quicktest.py

Notice the output below.

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Range                      
-------  --------------  ----------  --------------  --------------  --------------  ---------------------------
   98.0     26162924140           5    5232584828.0      5221783723      5242290976  Run signal.correlate2d     
    1.4       366266550           5      73253310.0          116566       365778317  Copy signal to GPU         
    0.6       156400519           5      31280103.8          504379       154166494  Run cusignal.correlate2d   
    0.0         1997599           5        399519.8          350902          591358  Create CPU signal          
    0.0          501849           5        100369.8           90972          136938  Create CPU filter          
    0.0          484525           5         96905.0           72400          163420  Copy filter to GPU

You see that I ran multiple calls 5 times. You should notice that CPU calls are pretty consistent, while there's a swing in the GPU calls. This because the GPU is warming up. You want to sample a several runs or drop the first few to get a good number. Once the GPU is warmed up and there are no stochastic algorithms being executed the times will be consistent.

To get a better understanding, I highly suggested reviewing the output with our Nsight Systems GUI! https://devblogs.nvidia.com/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/

I've attached the qdrep file from this example

cusignal.tar.gz

flytrex-vadim commented 4 years ago

From HEAD/python run: pytest -v --benchmark-only -k correlate2d.

I'm getting:

ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --benchmark-min-rounds=25 --benchmark-warmup=on --benchmark-warmup-iterations=10 --benchmark-disable-gc --benchmark-skip --benchmark-only

Running pytest -v from HEAD\ works fine, all tests pass.

awthomp commented 4 years ago

From HEAD/python run: pytest -v --benchmark-only -k correlate2d.

I'm getting:
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --benchmark-min-rounds=25 --benchmark-warmup=on --benchmark-warmup-iterations=10 --benchmark-disable-gc --benchmark-skip --benchmark-only
Running pytest -v from HEAD\ works fine, all tests pass.

Be sure to conda install or pip install pytest-benchmarks.

awthomp commented 4 years ago

@flytrex-vadim -- We've filed another feature request based on your questions. https://github.com/rapidsai/cusignal/issues/77

Thanks again for the great discussion.

flytrex-vadim commented 4 years ago

Thanks again for the great discussion.

Happy to help with some noob testing :)

You can always run our benchmark tests on your 1050 TI and let us know what you see.

Results attached: bench_correlate2d.txt

Are there any reference results for comparison and/or guidelines how to interpret them ?

mnicely commented 4 years ago

Hi @flytrex-vadim, thanks for the benchmark. I found a bug today that causes the output to be sorted incorrectly. Accurate results are the median. I'll be pushing a fix in the next few days.

awthomp commented 4 years ago

Closing this issue; we certainly appreciate the discussion and the 2 feature requests generated from it!

rapidsai / cusignal

[QST] Is lazy evaluation used ? #74