NaN Residuals with CUDA GMRES+ParILUT

iontcheva commented 11 months ago

Hello,

I have been testing ParILUT with GMRES with linear systems extracted from my application.

I am seeing NaN residuals with the CUDA exec on H100. The versions of the exact same code but with OMP and Reference exec converge in 14 GMRES iterations without issues.

The matrix of the linear system is in the 500K range and due to the limitations for the size of files that can be uploaded I have split it into 7 parts. cis.mtx.gz.part-aa cis.mtx.gz.part-ab cis.mtx.gz.part-ac cis.mtx.gz.part-ad cis.mtx.gz.part-ae cis.mtx.gz.part-af cis.mtx.gz.part-ag

To merge these files into the actual matrix you can use : cat cis.mtx.gz.part* > cis.mtx.gz

The rhs file is relatively small: cis_rhs.mtx.gz

I will upload the files in a few submissions that follow.

See the snapshots showing the NaN with the CUDA exec and the runs with omp and reference exec. CUDA_exec_1 CUDA_exec_2 OMP_exec Reference_exec

iontcheva commented 11 months ago

Part 1 of the matrix cis.mtx.gz.part-aa.gz

iontcheva commented 11 months ago

I have been debugging the issue and tracked it down to : ginkgo/common/cuda_hip/factorization/par_ilut_spgeam_kernels.hpp.inc

line 131

lu_cur_val is NaN or INF

I am using a slightly modified version of ilu-preconditioned-solver-example.cpp for my testing.

These are the parameters that I am using: auto par_ilu_fact = gko::factorization::ParIlut<ValueType, IndexType>::build() .with_iterations(10u) .with_fill_in_limit(2.0) .on(exec);

const RealValueType reduction_factor{1e-7};
auto ilu_gmres_factory =
  gmres::build()
        .with_criteria(gko::stop::Iteration::build().with_max_iters(100u),
                       gko::stop::ResidualNorm<ValueType>::build()
                           .with_reduction_factor(reduction_factor))
        .with_generated_preconditioner(ilu_preconditioner)
        .on(exec);

Please let me know whether you can reproduce the issue.

MarcelKoch commented 11 months ago

Hi @iontcheva, the ParILUT can be quite unstable, especially across different executors, since it only computes an approximation of the ILU factorization. How close the approximation is usually depends on the .with_iterations parameter. So, as a first step, maybe try to increase that parameter.

upsj commented 11 months ago

I think what's likely happening here is that while we guard against NaNs/Infs in our asynchronous sweep, they may still come up in the other operations, e.g. from an overflowing value in SpGEMM (which is where the lu_val comes from). Without looking at the specific problem, I'm not sure we can do much about this, the preconditioner may just not work on certain problems.

iontcheva commented 11 months ago

Hi MarcelKoch, upsj,

I did try many options for .with_iterations - including the value 20 which is quite high but the issue is not resolved.

I think there is a bug in the implementation of the CUDA version.

The OMP and Reference versions work perfectly fine as I have shown above so I think the ParILUT algorithm is not the problem.

Regarding the comment from upsj - I have a check if (!is_finite(lu_val) || is_nan(lu_val)) in tri_spgeam_init after line 118 in ginkgo/common/cuda_hip/factorization/par_ilut_spgeam_kernels.hpp.inc auto lu_val = checked_load(lu_vals, lu_begin + lane, lu_end, zero());

and it does not seem to get triggered which I think should mean that the values computed in the CuSparse SpGEMM should be fine.

What gets triggered is a similar check on the value lu_cur_val after line 131.

Did you manage to assemble the matrix that I sent and try the specific example? I do not think without looking at the specific example one can say much anyway.

If you can reproduce the issue though on your side and resolve it I think it would make Ginkgo CUDA much more useful for many applications.

Now one cannot solve anything harder (and all the cases in real applications are of that kind) with a simple ILU(0) type of preconditioner like ParILU one needs more advanced preconditioners like ParILUT.

I am not sure whether this is relevant but just an observation :

After you did the fix with the atomic load_relaxed and store_relaxed in the sweep a few weeks ago I was able to get come of my smaller examples work with Ginkgo CUDA GMRES+ParILUT which was not the case before - I was getting NaNs on all of my examples.