possible issue in use of oneAPI DFT

mabraham commented 1 year ago

I'm working at Intel using HeFFTe for the distributed 3D-FFT used in GROMACS (see https://gitlab.com/gromacs/gromacs/-/blob/main/src/gromacs/fft/gpu_3dfft_heffte.cpp). HeFFTe's unit tests are failing on PVC platforms, even when using only a single rank and device. The failure mode is not uniformly reproducible, which naturally suggests that the device memory usage is inconsistent with the synchronization.

Since I was interested in single-precision R2C 3D-FFT, I focused on the failures I see in test_fft3d_r2c. Sometimes I see a nearly correct result but the first 20 reals in the output buffer are wrong. Since the box is {3,4,5} and 4*5==20, I think there is some wrong indexing going on. If I refactor heffte::onemkl_executor_r2c::forward() and ...backward() so that the q.wait() calls move into the for loop over blocks (see https://github.com/icl-utk-edu/heffte/blob/master/include/heffte_backend_oneapi.h#L502), then I observe correct results. The code then looks like

    //! \brief Forward transform, single precision.
    void forward(float const indata[], std::complex<float> outdata[], std::complex<float>*) const override{
        if (not init_splan) make_plan(splan);
        for(int i=0; i<blocks; i++)
        {
            oneapi::mkl::dft::compute_forward(splan, const_cast<float*>(indata + i * rblock_stride), reinterpret_cast<float*>(outdata + i * cblock_stride))\
;
            q.wait();
        }
    }

If those per-block calls to compute_forward etc. have distinct memory footprints then the extra synchronization should not help. So I think the memory footprints for each block do overlap, perhaps for the same reason I see wrong results in the first 20 reals of the output buffer. Perhaps blocks are running concurrently and the previous block is over-running its bounds and affecting the calculation of the current block?

I can't see any obvious indexing errors, but I'm not a DFT expert. Do you see anything in the code that would work better? Is there anything you want me to test for you?

mkstoyanov commented 1 year ago

There is so much overlap between heFFTe backends that simple indexing issue is unlikely, but your conjecture of the race condition is probably correct.

Every FFT implementation requires additional scratch space, FFTW and cuFFT hide it inside the plan, ROCm accepts external workspace buffer. Since OneMKL does not take a buffer, I'm guessing they are using scratch inside the plan.

As a general rule, everything submitted to a SYCL queue will be executed concurrently q.wait() is called manually or DAG dependencies are added explicitly. This is contrast to a CUDA/HIP stream where things run in sequence. Thus, I think, the error comes from the scratch space inside the plan.

I will do some testing and see if there is a better way to fix this than to move the q.wait(). I'm concerned about excessive synchronization.

On a quick note, can you please double-check if the problem happens when you don't move the wait() but you force use_reorder options, e.g., change lines 43 and 103 in test/test_fft3d_r2c.cpp to look like:

for(auto options : make_all_options<backend_tag>()){
    options.use_reorder = true;

When using reorder, the number of blocks is just 1.

mabraham commented 1 year ago

OK, I'll try that tomorrow. Also, making a plan per block would avoid racing on the contents of the scratch/workspace buffer.

mkstoyanov commented 1 year ago

The number of blocks is roughly N / num_ranks which can be a large number all with associated memory overhead. Plan per-block will not solve this, we would need a mechanism to control the number of plans and hold back to a moderate number, then map blocks to plans ...

It's doable, but not trivial.

mabraham commented 1 year ago

On a quick note, can you please double-check if the problem happens when you don't move the wait() but you force use_reorder options, e.g., change lines 43 and 103 in test/test_fft3d_r2c.cpp to look like:
for(auto options : make_all_options<backend_tag>()){
    options.use_reorder = true;
When using reorder, the number of blocks is just 1.

Doing that (plus the same at line 170) seems to work. The test_fft3d_r2c_np8 case did fail one time, but passed many many times, so perhaps it was just some transient in the stack.

mabraham commented 1 year ago

On a quick note, can you please double-check if the problem happens when you don't move the wait() but you force use_reorder options, e.g., change lines 43 and 103 in test/test_fft3d_r2c.cpp to look like:
for(auto options : make_all_options<backend_tag>()){
    options.use_reorder = true;
When using reorder, the number of blocks is just 1.
Doing that (plus the same at line 170) seems to work. The test_fft3d_r2c_np8 case did fail one time, but passed many many times, so perhaps it was just some transient in the stack.

But as soon as I post I see

18: --------------------------------------------------------------------------------
18:                              heffte::fft_r2c class
18: --------------------------------------------------------------------------------
18:
18:      float              -np 6  test heffte::fft3d_r2c<stock>              pass
18:     double              -np 6  test heffte::fft3d_r2c<stock>              pass
18:      float                -np 6  test heffte::fft3d_r2c<mkl>              pass
18:     double                -np 6  test heffte::fft3d_r2c<mkl>              pass
18:      float             -np 6  test heffte::fft3d_r2c<onemkl>              pass
18:  error magnitude: 3.14182
18: terminate called after throwing an instance of 'std::runtime_error'
18:   what():  mpi rank = 3  test -np 6  test heffte::fft3d_r2c in file: /home/markjame/bugs/IMPI-HeFFTe-performance/heffte/test/test_fft3d_r2c.cpp line: 144

mkstoyanov commented 1 year ago

Well, there's some improvement and we made it from n problems to n-1.

I will try to hunt the other problem today.

mabraham commented 1 year ago

One solution proposed in #18.

The number of blocks is roughly N / num_ranks which can be a large number all with associated memory overhead. Plan per-block will not solve this, we would need a mechanism to control the number of plans and hold back to a moderate number, then map blocks to plans ...

It's doable, but not trivial.

In that case using user-provided oneMKL workspaces seems like a possible approach: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2023-0/user-allocated-workspaces.html. Now we can have a single plan, one workspace per block and attach the right workspace before launching the computation.

Several fixes for this issue are possible:

The simplest solution is to append .wait() to the compute_* calls, to wait on the returned event, but that requires that the runtime synchronize with the host after each call and before launching the next compute call.
It's more efficient to avoid the host synch by chaining the compute calls via the events (see #18). This approach probably works similarly to how the CUDA and ROCm ports work in practice. The compute kernels still cannot overlap with the tail of the previous kernel. It’s also annoying to have to maintain and use a vector of events just to make a dependency on a single event
Potentially faster still is to use the workspace-per-block solution above continue using q.wait(). That requires a lot more memory, so maybe you'd only want to do that for small N, where the time in the compute kernel is comparable with the time spent during tails of kernels and launching new kernels and the memory cost is bearable.

mkstoyanov commented 1 year ago

The compute call does not return to the user, but it goes into a reshape operation, which starts with packing then followed by MPI calls. Packing operates on a sub-section of the data (not workspace) buffers and most pack operations take chunks of most blocks. The opportunity for an overlap is near zero, especially considering the relative cost of compute (3%) vs MPI (97%). Even if I allow for partial overlap between compute and packing, the MPI calls are executed on a queue that is internal to MPI and I cannot pass in SYCL dependencies or sync with that queue.
Chaining the block calls is the correct thing to do under the circumstances. The approach is annoying but your solution is not too bad. I would remove the variadrics and the resize of the event vector (resize in the constructor only), then make the events vector public.
The workspace is assigned during the construction of the plan and currently heFFTe can work on different workspace vectors for each call. There is a potential conflict there in some use cases, more API to specify concurrency, and in the end it will affect less than 3% of the problem. I will not do it for now, unless someone convinces me that it will make a difference.

mabraham commented 1 year ago

Indeed. I was referring to the need to synchronize with the host between calls to compute on successive blocks. One could in principle chain the event dependencies from compute to the packing operations (that's an optimization I will explore soon for my small-N use case, so that kernel launch overheads overlap with computation). One hope for the future is that Intel MPI will be able to do device-device MPI-3 RMA calls from within compute kernels. That may be interesting for you when the time comes.
Indeed
Yeah, the complexity for API and implementation seems unlikely to be worth the cost. But I might try it some day.

mkstoyanov commented 1 year ago

Synchronizing FFT compute with packing will be very tricky since you'll need to read the packing plan to make sure that all the FFT data has been populated (FFT is computed) for the indexes that the pack-plan will need.

Calling MPI from within the pack kernel may be interesting ... however, be careful not to waste your time chasing dead-end rabbit holes.

For small FFTs, the dominant cost is in network latency when initiating the MPI calls. Before anything else, you should look into ways to reduce the latency:

See if you can batch multiple FFT operations into one, that pushes the problem from latency bound to communication bound, which is much better.
Try disabling GPU-aware MPI, you can do that with the options.use_gpu_aware = false; It may be counter intuitive, but we have seen GPU devices that have much higher MPI latency and moving data back to the CPU and using the lower latency on the CPU was in fact faster. Especially for small N.
Try looking at the sub-comm option, it is more beneficial in the brick-to-brick format, and less so on the pencil-to-pencil format of GROMACS, but it can still make a difference. The sub-comm will move data to a subset of the total ranks and perform the intermediate FFTs there. The option can reduce communication and latency.

mabraham commented 1 year ago

Thanks for the advice!

icl-utk-edu / heffte

possible issue in use of oneAPI DFT #16