project-asgard / asgard

MIT License
27 stars 20 forks source link

Bad integer for large problems with `--kron-mode sparse` #641

Closed stefan-schnake closed 11 months ago

stefan-schnake commented 11 months ago

Describe the bug I believe a int overflow is causing higher level 1x3v mixed grid runs to fail when using --kron-mode sparse. This does not happen with --kron-mode dense

Here's the result upon debugging: asgard: ~/asgard/src/distribution.hpp:269: double asgard::get_MB(int64_t) [with P = int; int64_t = long int]: Assertion `num_elems > 0' failed.

To Reproduce Steps to reproduce the behavior: Run command ./asgard -p riemann_1x3v -d 3 -l "7 6 6 6" -x -t 2.3419e-4 -n 210 -m 7 --kron-mode sparse --wave_freq 7 --inner_it 10 --tol 1e-8 --mixed_grid_group 1

Expected behavior A clear and concise description of what you expected to happen.

System:

quantumsteve commented 11 months ago

Running with CXXFLAGS=-fsanitize=undefined

$ ./asgard -p riemann_1x3v -d 3 -l "7 6 6 6" -x -t 2.3419e-4 -n 210 -m 7 --kron-mode sparse --inner_it 10 --tol 1e-8 --mixed_grid_group 1
Branch: develop
Commit Summary: 6d9cb87cd16c17c0ee53f5037e1f3a9bbdaec9b0
  Date: Fri Sep 22 09:22:33 2023 -0400
      Update OpenBLAS to v0.3.24
This executable was built on Wednesday, September 27 2023 at  3:25 pm
generating: pde...
ASGarD problem configuration:
  selected PDE: riemann_1x3v
  degree: 3
  N steps: 210
  write freq: 0
  realspace freq: 0
  implicit: 0
  full grid: 0
  CFL number: 0.01
  Poisson solve: 0
  starting levels: 7 6 6 6 
  max adaptivity levels: 7
--- begin setup ---
  generating: adaptive grid...
  degrees of freedom: 7133184
  generating: basis operator...
  basis operator allocation (MB): 0.0348816
  generating: dimension mass matrices...
  generating: initial conditions...
  degrees of freedom (post initial adapt): 7133184
  generating: coefficient matrices...
  generating: moment vectors...
--- begin time loop staging ---
--- begin time loop w/ dt 0.00023419 ---
 dim0 lev = 7
 dim1 lev = 6
 -- moment_mat map size = 7133184
384
 -- moment_mat map size = 7133184
384
 -- moment_mat map size = 7133184
384
 -- moment_mat map size = 7133184
384
 -- moment_mat map size = 7133184
768
 -- moment_mat map size = 7133184
768
 -- moment_mat map size = 7133184
768
/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:177:43: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:178:43: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:178:30: runtime error: shift exponent -1 is negative
/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:178:43: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:177:30: runtime error: shift exponent -1 is negative
/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:177:43: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:178:43: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:178:43: runtime error: signed integer overflow: -2147483648 - 1 cannot be represented in type 'int'
quantumsteve commented 11 months ago

backtrace, L1 = 0

768
asgard: /home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:173: bool asgard::check_connected(int, int, int, int): Assertion `L1 > 0 && L1 <= std::numeric_limits<int>::digits' failed.

Thread 1 "asgard" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007fffee087859 in __GI_abort () at abort.c:79
#2  0x00007fffee087729 in __assert_fail_base (fmt=0x7fffee21d588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5555567dc508 "L1 > 0 && L1 <= std::numeric_limits<int>::digits", 
    file=0x5555567dc3a0 "/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp", line=173, function=<optimized out>) at assert.c:92
#3  0x00007fffee098fd6 in __GI___assert_fail (assertion=0x5555567dc508 "L1 > 0 && L1 <= std::numeric_limits<int>::digits", file=0x5555567dc3a0 "/home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp", line=173, 
    function=0x5555567dc4c8 "bool asgard::check_connected(int, int, int, int)") at assert.c:101
#4  0x000055555605d84f in asgard::check_connected (L1=0, p1=0, L2=0, p2=0) at /home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:173
#5  0x000055555605de92 in asgard::check_connected (num_dimensions=4, row=0x7fffdc08c010, col=0x7fffdc08c010) at /home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:226
#6  0x0000555556061086 in asgard::compute_mem_usage<double> (pde=..., discretization=..., program_options=..., imex=asgard::imex_flag::imex_explicit, spcache=..., memory_limit_MB=0, index_limit=2147483646, force_sparse=false)
    at /home/svh/Documents/asgard/src/asgard_kronmult_matrix.cpp:854
#7  0x000055555601901c in asgard::matrix_list<double>::make (this=0x7fffffffcc70, entry=asgard::imex_explicit, pde=..., grid=..., opts=...) at /home/svh/Documents/asgard/src/asgard_kronmult_matrix.hpp:787
#8  0x000055555601f5f8 in asgard::matrix_list<double>::reset_coefficients (this=0x7fffffffcc70, entry=asgard::imex_explicit, pde=..., grid=..., opts=...) at /home/svh/Documents/asgard/src/asgard_kronmult_matrix.hpp:851
#9  0x0000555556009a6d in asgard::time_advance::imex_advance<double> (pde=..., operator_matrices=..., adaptive_grid=..., transformer=..., program_opts=..., unscaled_parts=..., f_0=..., x_prev=..., time=0.00023419000000000001, 
    solver=asgard::solve_opts::gmres, update_system=true) at /home/svh/Documents/asgard/src/time_advance.cpp:778
#10 0x0000555556003592 in asgard::time_advance::adaptive_advance<double> (step_method=asgard::time_advance::method::imex, pde=..., operator_matrices=..., adaptive_grid=..., transformer=..., program_opts=..., x_orig=..., 
    time=0.00023419000000000001, update_system=true) at /home/svh/Documents/asgard/src/time_advance.cpp:73
#11 0x0000555555d3eb18 in main (argc=22, argv=0x7fffffffd098) at /home/svh/Documents/asgard/src/main.cpp:270
mkstoyanov commented 11 months ago

Printing the indexes, we are getting an index of 31. Now digging in the index formation to see what went wrong.

mkstoyanov commented 11 months ago

@quantumsteve I don't understand the trace. We count the levels from 0, even though we don't allow the user to work with level less than 1. We need at least two levels to form a "wavelet" basis.

quantumsteve commented 11 months ago

If L1 <= 0 or L2 <= 0 the bit shift would be by a negative number.

https://github.com/project-asgard/asgard/blob/6d9cb87cd16c17c0ee53f5037e1f3a9bbdaec9b0/src/asgard_kronmult_matrix.cpp#L177

mkstoyanov commented 11 months ago

Is that considered an overflow? The case L1=0 and L2=0 exists in absolutely every grid we have ever used and tested.

mkstoyanov commented 11 months ago

BTW @stefan-schnake did you enable memory limits in CMake when you compiled this?

stefan-schnake commented 11 months ago

BTW @stefan-schnake did you enable memory limits in CMake when you compiled this?

No. Unless that's a default option in change.

mkstoyanov commented 11 months ago

There is a small bug that throws us off in the error message, I'll fix that tonight.

Nevertheless, I think we are likely to run out of GPU memory for this problem and we may have to enable the memory limits (which in turn may cancel all benefits of going sparse).

mkstoyanov commented 11 months ago

Resolved in #642