Closed mabraham closed 1 year ago
There is so much overlap between heFFTe backends that simple indexing issue is unlikely, but your conjecture of the race condition is probably correct.
Every FFT implementation requires additional scratch space, FFTW and cuFFT hide it inside the plan, ROCm accepts external workspace buffer. Since OneMKL does not take a buffer, I'm guessing they are using scratch inside the plan.
As a general rule, everything submitted to a SYCL queue will be executed concurrently q.wait()
is called manually or DAG dependencies are added explicitly. This is contrast to a CUDA/HIP stream where things run in sequence. Thus, I think, the error comes from the scratch space inside the plan.
I will do some testing and see if there is a better way to fix this than to move the q.wait()
. I'm concerned about excessive synchronization.
On a quick note, can you please double-check if the problem happens when you don't move the wait()
but you force use_reorder
options, e.g., change lines 43 and 103 in test/test_fft3d_r2c.cpp
to look like:
for(auto options : make_all_options<backend_tag>()){
options.use_reorder = true;
When using reorder, the number of blocks is just 1.
OK, I'll try that tomorrow. Also, making a plan per block would avoid racing on the contents of the scratch/workspace buffer.
The number of blocks is roughly N / num_ranks
which can be a large number all with associated memory overhead. Plan per-block will not solve this, we would need a mechanism to control the number of plans and hold back to a moderate number, then map blocks to plans ...
It's doable, but not trivial.
On a quick note, can you please double-check if the problem happens when you don't move the
wait()
but you forceuse_reorder
options, e.g., change lines 43 and 103 intest/test_fft3d_r2c.cpp
to look like:for(auto options : make_all_options<backend_tag>()){ options.use_reorder = true;
When using reorder, the number of blocks is just 1.
Doing that (plus the same at line 170) seems to work. The test_fft3d_r2c_np8 case did fail one time, but passed many many times, so perhaps it was just some transient in the stack.
On a quick note, can you please double-check if the problem happens when you don't move the
wait()
but you forceuse_reorder
options, e.g., change lines 43 and 103 intest/test_fft3d_r2c.cpp
to look like:for(auto options : make_all_options<backend_tag>()){ options.use_reorder = true;
When using reorder, the number of blocks is just 1.
Doing that (plus the same at line 170) seems to work. The test_fft3d_r2c_np8 case did fail one time, but passed many many times, so perhaps it was just some transient in the stack.
But as soon as I post I see
18: --------------------------------------------------------------------------------
18: heffte::fft_r2c class
18: --------------------------------------------------------------------------------
18:
18: float -np 6 test heffte::fft3d_r2c<stock> pass
18: double -np 6 test heffte::fft3d_r2c<stock> pass
18: float -np 6 test heffte::fft3d_r2c<mkl> pass
18: double -np 6 test heffte::fft3d_r2c<mkl> pass
18: float -np 6 test heffte::fft3d_r2c<onemkl> pass
18: error magnitude: 3.14182
18: terminate called after throwing an instance of 'std::runtime_error'
18: what(): mpi rank = 3 test -np 6 test heffte::fft3d_r2c in file: /home/markjame/bugs/IMPI-HeFFTe-performance/heffte/test/test_fft3d_r2c.cpp line: 144
Well, there's some improvement and we made it from n
problems to n-1
.
I will try to hunt the other problem today.
One solution proposed in #18.
The number of blocks is roughly N / num_ranks which can be a large number all with associated memory overhead. Plan per-block will not solve this, we would need a mechanism to control the number of plans and hold back to a moderate number, then map blocks to plans ...
It's doable, but not trivial.
In that case using user-provided oneMKL workspaces seems like a possible approach: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-0/user-allocated-workspaces.html. Now we can have a single plan, one workspace per block and attach the right workspace before launching the computation.
Several fixes for this issue are possible:
The simplest solution is to append .wait()
to the compute_*
calls, to wait on the returned event, but that requires that the runtime synchronize with the host after each call and before launching the next compute call.
It's more efficient to avoid the host synch by chaining the compute calls via the events (see #18). This approach probably works similarly to how the CUDA and ROCm ports work in practice. The compute kernels still cannot overlap with the tail of the previous kernel. It’s also annoying to have to maintain and use a vector of events just to make a dependency on a single event
Potentially faster still is to use the workspace-per-block solution above continue using q.wait()
. That requires a lot more memory, so maybe you'd only want to do that for small N, where the time in the compute kernel is comparable with the time spent during tails of kernels and launching new kernels and the memory cost is bearable.
The compute call does not return to the user, but it goes into a reshape operation, which starts with packing then followed by MPI calls. Packing operates on a sub-section of the data (not workspace) buffers and most pack operations take chunks of most blocks. The opportunity for an overlap is near zero, especially considering the relative cost of compute (3%) vs MPI (97%). Even if I allow for partial overlap between compute and packing, the MPI calls are executed on a queue that is internal to MPI and I cannot pass in SYCL dependencies or sync with that queue.
Chaining the block calls is the correct thing to do under the circumstances. The approach is annoying but your solution is not too bad. I would remove the variadrics and the resize of the event vector (resize in the constructor only), then make the events
vector public.
The workspace is assigned during the construction of the plan and currently heFFTe can work on different workspace vectors for each call. There is a potential conflict there in some use cases, more API to specify concurrency, and in the end it will affect less than 3% of the problem. I will not do it for now, unless someone convinces me that it will make a difference.
Synchronizing FFT compute with packing will be very tricky since you'll need to read the packing plan to make sure that all the FFT data has been populated (FFT is computed) for the indexes that the pack-plan will need.
Calling MPI from within the pack kernel may be interesting ... however, be careful not to waste your time chasing dead-end rabbit holes.
For small FFTs, the dominant cost is in network latency when initiating the MPI calls. Before anything else, you should look into ways to reduce the latency:
options.use_gpu_aware = false;
It may be counter intuitive, but we have seen GPU devices that have much higher MPI latency and moving data back to the CPU and using the lower latency on the CPU was in fact faster. Especially for small N.Thanks for the advice!
I'm working at Intel using HeFFTe for the distributed 3D-FFT used in GROMACS (see https://gitlab.com/gromacs/gromacs/-/blob/main/src/gromacs/fft/gpu_3dfft_heffte.cpp). HeFFTe's unit tests are failing on PVC platforms, even when using only a single rank and device. The failure mode is not uniformly reproducible, which naturally suggests that the device memory usage is inconsistent with the synchronization.
Since I was interested in single-precision R2C 3D-FFT, I focused on the failures I see in
test_fft3d_r2c
. Sometimes I see a nearly correct result but the first 20 reals in the output buffer are wrong. Since the box is {3,4,5} and 4*5==20, I think there is some wrong indexing going on. If I refactorheffte::onemkl_executor_r2c::forward()
and...backward()
so that theq.wait()
calls move into the for loop over blocks (see https://github.com/icl-utk-edu/heffte/blob/master/include/heffte_backend_oneapi.h#L502), then I observe correct results. The code then looks likeIf those per-block calls to
compute_forward
etc. have distinct memory footprints then the extra synchronization should not help. So I think the memory footprints for each block do overlap, perhaps for the same reason I see wrong results in the first 20 reals of the output buffer. Perhaps blocks are running concurrently and the previous block is over-running its bounds and affecting the calculation of the current block?I can't see any obvious indexing errors, but I'm not a DFT expert. Do you see anything in the code that would work better? Is there anything you want me to test for you?