ginkgo-project / ginkgo

Numerical linear algebra software package
https://ginkgo-project.github.io/
BSD 3-Clause "New" or "Revised" License
401 stars 88 forks source link

BiCGStab OMP issue #1563

Closed blegouix closed 2 weeks ago

blegouix commented 6 months ago

Hello,

This issue is the continuation of #1561 with more details. We use BiCGStab to solve a quasi-tridiagonal pds problem with different executors (Serial, OMP and CUDA). We observe a numerical issue with OMP. In our test bench build upon those solvers (which involve more than just Ginkgo) we evaluate an error to 1e-16 with Serial or CUDA, but ~1e-10 with OMP even with OMP_NUM_THREADS=1.

The problem disappears just by replacing the OMP executor with a Serial one.

I link the matrix and the rhs which is problematic. Note that the size of the matrix has an impact (no problem with n=10 or n=100 ie.)

csr.txt rhs.txt

Please let me know if it is enough for you to reproduce the problem.

I tried with gcc and clang, with Ginkgo 1.7.0 and current develop branch.

Valgrind does not reveals anything and both cases (with serial or omp executor) "converge" in 3 iterations.

Issue from our side: https://github.com/CExA-project/ddc/issues/332 Problematic ginkgo apply call (all ginkgo stuff is in this file): https://github.com/CExA-project/ddc/blob/cc2942283213cc700b3bd8c09fb00346621997e3/include/ddc/kernels/splines/matrix_sparse.hpp#L207

(create_gko_exec<Kokkos::Serial>() is a gko::ReferenceExecutor::create(); while gko_exec is a gko::OmpExecutor::create();.)

Regards

MarcelKoch commented 6 months ago

From the code you linked, I could see that you are using a block Jacobi preconditioner, with block size 32 for OMP and reference. Have you tried using a block size of 1 and see if any differences appear? Nevertheless, it could still hint at some issue in our OMP implementation.

blegouix commented 6 months ago

It is indeed fixing the issue.

MarcelKoch commented 6 months ago

I've run our benchmarks with the provided matrices. If I used the same residual reduction as in your code 1e-19 I also get a difference in the final residual:

Perhaps just for clarification, by specifying only the reduction factor, the iteration will be stopped when ||r|| < eps ||b||, where b is the right-hand-side. In my experience using eps = 1e-19 is very small, and not reachable just with double precision. Maybe you want to use the absolut norm criteria ||r|| < eps, by adding .with_baseline(gko::stop::mode::absolute).

Interestingly, the difference disappears if mode::initial_resnorm is used, which corresponds to ||r|| < eps ||r_0||.

blegouix commented 6 months ago

Ok thank you, I indeed though I was using an absolute norm critera, whereas the "residual reduction" terminology was quite clear.

I let you close the issue depending if you think it would require an improvement from your side or not.

Thank you for the quick answer.

blegouix commented 6 months ago

@MarcelKoch how did you "run your benchmarks with the provided matrices" ? I see you have some benchmark using SuiteSparse but I see no way to use local data. Should I upload my matrix in SuiteSparse ? You did it already ?

MarcelKoch commented 6 months ago

No I didn't upload anything and it's not necessary to run our benchmarks.

Our benchmarks are only documented as part of reproducibility for some paper. How to use them individually is not really documented. To give you a broad summary:

First you need to define the input data in a json file. The json file has to consist of exactly one array, and within that array the test cases are defined. For example:

[
  {
    "filename": "path/to/your/matrix",
    "rhs": "path/to/your/rhs"
  },
  { ... }
]

The files have to be in matrix market format.

Some benchmarks require extra some extra fields. For example the solver benchmarks requires the field "optional": {"spmv": "format/such/as/csr"}.

Then you can call a benchmark and pass in the input via stdin, i.e.

solver < input.json

You can get the options that are available for a given benchmark by using --help.

The output of our benchmarks is again json, and it is printed to stdout, while our status messages are printed to stderr. So, you can store the output with

solver < input.json > output.json
blegouix commented 6 months ago

Great. It will be very usefull.

So, coming back to the current issue, stop using the preconditionner indeed solved the problem with the matrices I provided at first, but we still have some difficulties with OMP. So preconditionner was just part of the problem. I made a minimal reproducer for it:

https://github.com/blegouix/gko_omp_reprod/tree/main

Note that using the benchmarks on the matrices does not reproduces the problem, maybe because of the lower precision of numbers in the rhs.mtx file.

blegouix commented 6 months ago

@MarcelKoch sorry to ping you but can you confirm you have seen my message ? As I did not open a new issue and this one is quite old

upsj commented 6 months ago

If you apply the following patch and export your files using write_binary instead of write, you can store the rhs vector exactly

diff --git a/benchmark/solver/solver_common.hpp b/benchmark/solver/solver_common.hpp
index 19e718e08..541a9f662 100644
--- a/benchmark/solver/solver_common.hpp
+++ b/benchmark/solver/solver_common.hpp
@@ -288,7 +288,7 @@ struct SolverGenerator : DefaultSystemGenerator<> {
     {
         if (config.contains("rhs")) {
             std::ifstream rhs_fd{config["rhs"].get<std::string>()};
-            return gko::read<Vec>(rhs_fd, std::move(exec));
+            return gko::read_generic<Vec>(rhs_fd, std::move(exec));
         } else {
             gko::dim<2> vec_size{system_matrix->get_size()[0], FLAGS_nrhs};
             if (FLAGS_rhs_generation == "1") {
blegouix commented 6 months ago

Great. I can now reproduce the problem in your benchmark:

~/ginkgo/build/benchmark/solver$ ./solver -solvers bicgstab -executor omp < matrix.json                                                                                                                                                                                                                                       [28/1904]
This is Ginkgo 1.8.0 (develop)                                                            
    running with core module 1.8.0 (develop)                                              
    the reference module is  1.8.0 (develop)                                              
    the OpenMP    module is  1.8.0 (develop)                                              
    the CUDA      module is  1.8.0 (develop)                                              
    the HIP       module is  not compiled                                                 
    the DPCPP     module is  not compiled                                                 
Running on omp(0)                                                                         
Running with 2 warm iterations and 1 running iterations                                
The random seed for right hand sides is 42                                                
Running bicgstab with 1000 iterations and residual goal of 1.000000e-06                                                                                                              
The number of right hand sides is 1                                                       
Running test case /home/cart3sianbear/gko_omp_reprod/csr.mtx
Matrix is of size (10, 10)
        Running solver: bicgstab      
[                                      
    {                     
        "filename": "/home/cart3sianbear/gko_omp_reprod/csr.mtx",
        "rhs": "/home/cart3sianbear/gko_omp_reprod/build/rhs_binary",     
        "optimal": {                                                                      
            "spmv": "csr"                                                                 
        },                                                                                
        "solver": {                                                                       
            "bicgstab": {                                                                 
                "recurrent_residuals": [                                                  
                    0.14234837634095868,                                                  
                    0.14234837634095868   
                ],                                                                                                                                                                   
                "true_residuals": [                                                       
                    0.14234837634095868,                                                  
                    0.11108041181890674                                                   
                ],                                                                        
                "implicit_residuals": [                                                   
                    0.14234837634095868,                                                  
                    0.14234837634095868                                                   
                ],                                                                        
                "iteration_timestamps": [                                                 
                    0.000335548,                                                          
                    0.000767963                                                           
                ],                                                                        
                "rhs_norm": 2.23606797749979,
                "generate": {                                                             
                    "components": {                                                       
                        "generate(gko::solver::Bicgstab<double>::Factory)": 9.521e-06,
                        "generate(gko::matrix::IdentityFactory<double>)": 3.2340000000000003e-06,
                        "free": 4.942e-06,                                                
                        "overhead": 2.7524000000000003e-05
                    },                                                                    
                    "time": 2.0208e-05                                                    
                },                                                                        
                "apply": {                                                                
                    "components": {                                                       
                        "apply(gko::solver::Bicgstab<double>)": 1.929e-06,
                        "iteration": 0.000106693,
                        "allocate": 8.380000000000001e-06,
                        "dense::fill": 3.3087000000000004e-05,
                        "bicgstab::initialize": 1.4282000000000001e-05,
                        "advanced_apply(gko::matrix::Csr<double, int>)": 1.793e-05,
                        "csr::advanced_spmv": 1.8948000000000002e-05,
                        "dense::compute_norm2_dispatch": 0.00012472,
                        "free": 2.176e-06,                                                
                        "copy(gko::matrix::Dense<double>,gko::matrix::Dense<double>)": 1.3353e-05,
                        "dense::copy": 2.245e-05,
                        "dense::compute_conj_dot_dispatch": 0.000152076,
                        "check(gko::stop::Combined)": 1.4499000000000001e-05,
                        "check(gko::stop::ResidualNorm<double>)": 1.5558000000000002e-05,
                        "residual_norm::residual_norm": 8.1724e-05,
                        "check(gko::stop::Iteration)": 1.838e-06,
                        "bicgstab::step_1": 1.7100000000000002e-05,
                        "apply(gko::matrix::Identity<double>)": 6.95e-06,
                        "apply(gko::matrix::Csr<double, int>)": 6.9960000000000004e-06,
                        "csr::spmv": 1.6373e-05,
                        "bicgstab::step_2": 2.0842e-05,
                        "bicgstab::finalize": 1.4678e-05,
                        "overhead": 7.525500000000001e-05
                    },
                    "iterations": 0,
                    "time": 0.000532362
                },
                "preconditioner": {},
                "forward_error": 0.1359763191472606,
                "residual_norm": 0.12028732996140494,
                "repetitions": 1,
                "completed": true
            }
        },
        "rows": 10,
        "cols": 10
    }
]

I also have updated the reproducer to export the binary file.

We have noticed the problem looks absent with Gmres

blegouix commented 6 months ago

rhs_binary.txt csr.txt

MarcelKoch commented 6 months ago

Hi @blegouix, I just wanted to let you know that I'm able to reproduce your issue with the files you provided, but I might not find time this week to investigate this further.

blegouix commented 4 months ago

Hello, is there any progress on this issue ?

blegouix commented 2 months ago

Another example of the bug, this time with a 4x4 diagonal matrix. Works with OMP+CG, works with CUDA+BICGSTAB, do not works with OMP+BICGSTAB: mat.txt rhs.txt

It tends to converge but stops too early. Let think that the problem is in the stopping criterion ?

image

yhmtsai commented 3 weeks ago

Hi, @blegouix Thanks a lot for reporting this issue and sorry for late fix. I found one issue and fix it in https://github.com/ginkgo-project/ginkgo/pull/1676 Could you check whether it indeed solve the issue you face?

tpadioleau commented 3 weeks ago

Hi @yhmtsai I could test the branch and it seems to work correctly. Thanks a lot!

yhmtsai commented 2 weeks ago

@blegouix @tpadioleau Thank you! I have merged the fix to main branch now, so I close this issue. If you face another issue, please feel free to open an issue. Thanks again for finding this issue

This is solved by https://github.com/ginkgo-project/ginkgo/pull/1676