Closed flytrex-vadim closed 4 years ago
Hey @flytrex-vadim -- thanks for asking a question. @mnicely is our benchmark and performance guru, but a couple of observations:
cp.asarray(cpu_based_signal)
, you're also facing the overheads of things like cudaMalloc
under the hood.You can directly generate data on the GPU with CuPy with something like
signal_t = cp.random.rand(10_000_000, dtype=cp.float32)
Matt -- do you also mind posting an example on how to enable kernel pre-compilation and caching for correlate2d
?
Hi @awthomp , thanks for answering. Few comments:
%time corr_0_cusig = cusignal.correlate(sig_0, pulse, mode='valid', method='direct')
Wall time: 1e+03 µs
%time corr_0_cusig_cpu = corr_0_cusig.get()
Wall time: 22 ms
# ==============================
%time corr_0_cusig = cusignal.correlate(sig_0, pulse, mode='valid', method='fft')
Wall time: 20 ms
%time corr_0_cusig_cpu = corr_0_cusig.get()
Wall time: 1 ms
In fact this lack of blocking seems to prevents DIRECT method from being used in a loop.
Hi @flytrex-vadim,
I will try to recreate your scenario early next week and check if I'm missing any blocking.
In the meantime, can you use %timeit instead of %time. We've found it provides miss leading results with GPU profiling.
An even better way would be to use CuPy's NVTX markers.
from cupy import prof
@cp.prof.TimeRangeDecorator()
def test_baseline():
h_a = np.ones(size, np.int)
h_b = np.ones(size, np.int)
And profile with Nsight Systems
nsys profile --sample=none --trace=cuda,nvtx --stats=true python3 <python script>
To precompile the kernels try
cusignal._signaltools.precompile_kernels( [np.float32],
[GPUBackend.CUPY], [GPUKernel.CORRELATE],)
Hey @flytrex-vadim -- what's the size and dtypes for sig_0
and pulse
in your example? If you have any code you feel comfortable sharing, that might give us a better idea of what's going on.
One way to avoid some of the overhead in the CPU -> GPU transfer is to make use of our shared memory function that removes pages from being swapped by the OS (pinned) and then virtually addressed by the GPU (mapped). Here's an example with the polyphase resampler, but this basically creates a DMA between CPU and GPU.
Be careful how much memory you allocate here though -- as you can easily cause a kernel panic if you allocate too much.
import cupy as cp
import numpy as np
import cusignal
start = 0
stop = 10
num_samps = int(1e8)
resample_up = 2
resample_down = 3
# Generate Data on CPU
cx = np.linspace(start, stop, num_samps, endpoint=False)
cy = np.cos(-cx**2/6.0)
# Create shared memory between CPU and GPU and load with CPU signal (cy)
gpu_signal = cusignal.get_shared_mem(num_samps, dtype=np.float64)
%%time
# Move data to GPU/CPU shared buffer and run polyphase resampler
gpu_signal[:] = cy
gf = cusignal.resample_poly(gpu_signal, resample_up, resample_down, window=('kaiser', 0.5))
Thanks guys,
the array shapes I'm using for 2D : signal (181, 55000) pulse (1, 14999) correlation (181, 40002) with 1D case being row slices of the same. Tried both float64 and float32
Here's the snapshot of my experimental notebook: correlate_ipynb.zip
I'll try looking into shared memory and other timing mechanisms
btw, speaking of shared memory, how would I pass the allocated shared shared memory buffer to function to be used for output ?
btw, speaking of shared memory, how would I pass the allocated shared shared memory buffer to function to be used for output ?
Hey @flytrex-vadim -- once you've allocated the shared memory buffer and loaded it with data, you can use it alike any normal CuPy/cuSignal array.
For example, above:
# Create shared memory between CPU and GPU. This is like `numpy.zeros` and basically creates an
# empty memory slot for `num_samps` of `np.float64` data. Remember, the GPU and CPU can access
# this memory block, so you could run both numpy/scipy and cupy/cusignal calls on it.
gpu_signal = cusignal.get_shared_mem(num_samps, dtype=np.float64)
# Now migrate data into the empty buffer. In the case of your file read, you'd read your file into
# this newly created buffer.
gpu_signal[:] = cy
# Perform cusignal/cupy (or scipy/numpy) function on this `gpu_signal`. It's now an allocated array;
# the only difference is, again, it can be used for GPU and CPU processing.
gf = cusignal.resample_poly(gpu_signal, resample_up, resample_down, window=('kaiser', 0.5))
The way CuPy/cuSignal migrate data is basically via cp.asarray
. This functionality migrates data from CPU to GPU if and only if that data isn't already detected on GPU. Since you'd be calling a function with an array that's already mapped to GPU, cuSignal knows not to migrate that data OR to migrate that data, but since a direct path between GPU and CPU has been established, you don't invoke the penalty of having an extra copy via the bounce buffer.
Hi @awthomp , I've payed with variations of the example above. However they seem to address only the in parameters - the parameters passed to the function. My question was referring to the out parameter - the returned values. I would assume the returned array would be (by default) allocated on the GPU and would need to be transferred to the CPU accessible memory.
And I do not see any way to control the allocation of the return buffer.
This is a good point @flytrex-vadim, and it's been suggested on another thread -- basically that we can ensure memory external to cusignal functions is zero-copy, but it's all abstract internal to the function. If we create some internal array, for example, can we have that be zero copy too? Further -- you're correct; all output is assumed to be on GPU, and there's not currently a feature to return an array that's already been transferred to the host; or, for that matter -- making the output array be zero-copy rather than a standard CuPy array.
I'll file an issue about this in the next few days and point you to the conversation.
@flytrex-vadim -- I created a feature request addressing one of your comments here: https://github.com/rapidsai/cusignal/issues/76. Let's move discussion there.
Do you mind if I close this issue?
I think the two original questions remain:
Hey @flytrex-vadim. I was working out of your notebook and have a few comments observations:
It looks like your 1D correlation example is pitting scipy.signal.correlate
vs cusignal.correlate
with input data sizes for sig_0
at (55000,) and pulse
(14999). Here, I see similar behavior on a P100 as you: Scipy Signal is outperforming cuSignal on these data sizes.
For your 2D correlation, I see that you're benchmarking scipy.signal.correlate2d
with input data sizes (181, 55000) and (1, 14000) but are seemingly doing the 2D correlation with cusignal row by row? Is there a reason you didn't use cusignal.correlate2d
here? On my P100, this code finishes in ~55ms.
To directly address your points:
is there a missing wait or block in the 'direct' mode of correlation that would cause the function to return before the calculation is complete ?
CuPy launches asynchronously, but think we've been effectively blocking before results are returned. You can always add a numba.cuda.synchronize()
to confirm. Let me know if this sync fixes your issue.
is there a particular performance bottleneck in the array sizes I was using (signal:55000, pulse:14999) that would result in particularly slow execution, data transfer excluded ? Do I need to do zero padding to optimal size, or it happens automagically ?
I can confirm the perf here. We can profile this specific usecase, but I'm curious if this just isn't enough data to see the perf improvement we're used to.
Is there a reason you didn't use cusignal.correlate2d here? On my P100, this code finishes in ~55ms.
This is very interesting. On my GTX 1050 Ti it takes 1.4 seconds. I was trying both correlate2d and line-by-line option to see if there's any speedup in one of the options (1.4s line-by-line and 2.8s with correlate2d)
You can always add a numba.cuda.synchronize() to confirm. Let me know if this sync fixes your issue.
Yes, I can confirm that adding synchronize call waits for the operation to complete (tested with correlated2d)
Is there a reason you didn't use cusignal.correlate2d here? On my P100, this code finishes in ~55ms.
This is very interesting. On my GTX 1050 Ti it takes 1.4 seconds. I was trying both correlate2d and line-by-line option to see if there's any speedup in one of the options (1.4s line-by-line and 2.8s with correlate2d)
You can always run our benchmark tests on your 1050 TI and let us know what you see.
From HEAD/python
run: pytest -v --benchmark-only -k correlate2d
.
@flytrex-vadim cusignal functions are nonblocking by design. The same way a C++ CUDA kernel is nonblocking. And if you don't pass a non-default CuPy stream everything is launched in the default stream, which is blocking.
So a kernel launch is non-blocking, but it launches in a stream that is blocking??? Yes, it can be a little confusing but in a heterogeneous system it means that the host code launch work (e.g. kernel) and then control returns to the host. Therefore, host code and device code can run asynchronously. Since the default stream is blocking if you were to run a blocking call like a cudaMemcpy, the host code will be blocked until the copy is finished.
So lets think about that for a second... When you run %time
on individual calls like correlate2d ("direct") which a bulk of the compute is on the GPU, as soon as you launch the kernel execution flow returns to the CPU and your stop watch records a time. If you look at the source code you'll see there's a few more calls in ("fft").
Attached is sample code using our NVTX markers. We use Nsight Systems to profile the code.
nsys profile --sample=none --trace=cuda,nvtx --stats=true python3 quicktest.py
Notice the output below.
Time(%) Total Time Instances Average Minimum Maximum Range
------- -------------- ---------- -------------- -------------- -------------- ---------------------------
98.0 26162924140 5 5232584828.0 5221783723 5242290976 Run signal.correlate2d
1.4 366266550 5 73253310.0 116566 365778317 Copy signal to GPU
0.6 156400519 5 31280103.8 504379 154166494 Run cusignal.correlate2d
0.0 1997599 5 399519.8 350902 591358 Create CPU signal
0.0 501849 5 100369.8 90972 136938 Create CPU filter
0.0 484525 5 96905.0 72400 163420 Copy filter to GPU
You see that I ran multiple calls 5 times. You should notice that CPU calls are pretty consistent, while there's a swing in the GPU calls. This because the GPU is warming up. You want to sample a several runs or drop the first few to get a good number. Once the GPU is warmed up and there are no stochastic algorithms being executed the times will be consistent.
To get a better understanding, I highly suggested reviewing the output with our Nsight Systems GUI! https://devblogs.nvidia.com/transitioning-nsight-systems-nvidia-visual-profiler-nvprof/
I've attached the qdrep file from this example
From
HEAD/python
run:pytest -v --benchmark-only -k correlate2d
.
I'm getting:
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --benchmark-min-rounds=25 --benchmark-warmup=on --benchmark-warmup-iterations=10 --benchmark-disable-gc --benchmark-skip --benchmark-only
Running pytest -v from HEAD\ works fine, all tests pass.
From
HEAD/python
run:pytest -v --benchmark-only -k correlate2d
.I'm getting:
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...] pytest: error: unrecognized arguments: --benchmark-min-rounds=25 --benchmark-warmup=on --benchmark-warmup-iterations=10 --benchmark-disable-gc --benchmark-skip --benchmark-only
Running pytest -v from HEAD\ works fine, all tests pass.
Be sure to conda install or pip install pytest-benchmarks.
@flytrex-vadim -- We've filed another feature request based on your questions. https://github.com/rapidsai/cusignal/issues/77
Thanks again for the great discussion.
Thanks again for the great discussion.
Happy to help with some noob testing :)
You can always run our benchmark tests on your 1050 TI and let us know what you see.
Results attached: bench_correlate2d.txt
Are there any reference results for comparison and/or guidelines how to interpret them ?
Hi @flytrex-vadim, thanks for the benchmark. I found a bug today that causes the output to be sorted incorrectly. Accurate results are the median. I'll be pushing a fix in the next few days.
Closing this issue; we certainly appreciate the discussion and the 2 feature requests generated from it!
I'm trying to do a simple benchmarking of cuSignal vs signal on correlation. It seems that the correlate2d completes immediately, and the buffer transfer takes 2.8 seconds. Makes no sense for buffer size of 80 MBytes. Could it be the correlation is evaluated lazily only when buffer transfer is requested ?
and getting :