viennacl / viennacl-dev

Developer repository for ViennaCL. Visit http://viennacl.sourceforge.net/ for the latest releases.
Other
281 stars 89 forks source link

SEGFAULT when trying to release compressed matrix from memory #249

Closed qalshidi closed 6 years ago

qalshidi commented 6 years ago

There seems to be some error when trying to release memory in a shared pointer used in the library code itself. This happens when the code block is over and I think the shared point deconstruction is activated. Here is the gdb output: Thread 1 "steadymhd.exec" received signal SIGSEGV, Segmentation fault. _int_free (av=0x7ffff6ee6b20 <main_arena>, p=0x4ad58f0, have_lock=0) at malloc.c:3984 3984 malloc.c: No such file or directory. (gdb) backtrace #0 _int_free (av=0x7ffff6ee6b20 <main_arena>, p=0x4ad58f0, have_lock=0) at malloc.c:3984 #1 0x00007ffff6ba653c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 #2 0x0000000000413c5d in viennacl::tools::shared_ptr<viennacl::device_specific::symbolic_binder>::dec (this=<optimized out>) at /home/qusai/include/viennacl/tools/shared_ptr.hpp:167 #3 viennacl::tools::shared_ptr<char>::~shared_ptr (this=0x7fffc0781098, __in_chrg=<optimized out>) at /home/qusai/include/viennacl/tools/shared_ptr.hpp:120 #4 viennacl::backend::mem_handle::~mem_handle (this=0x7fffc0781090, __in_chrg=<optimized out>) at /home/qusai/include/viennacl/backend/mem_handle.hpp:89 #5 viennacl::compressed_matrix<double, 1u>::~compressed_matrix ( this=0x7fffc0781010, __in_chrg=<optimized out>) at /home/qusai/include/viennacl/compressed_matrix.hpp:612 #6 std::_Destroy<viennacl::compressed_matrix<double, 1u> > ( __pointer=<optimized out>) at /usr/include/c++/5/bits/stl_construct.h:93 #7 std::_Destroy_aux<false>::__destroy<viennacl::compressed_matrix<double, 1u>*> ( __last=<optimized out>, __first=0x7fffc0781010) at /usr/include/c++/5/bits/stl_construct.h:103 #8 std::_Destroy<viennacl::compressed_matrix<double, 1u>*> ( __last=<optimized out>, __first=<optimized out>) at /usr/include/c++/5/bits/stl_construct.h:126 #9 std::_Destroy<viennacl::compressed_matrix<double, 1u>*, viennacl::compressed_matrix<double, 1u> > (__last=0x7fffc07811d0, __first=<optimized out>) at /usr/include/c++/5/bits/stl_construct.h:151 #10 std::vector<viennacl::compressed_matrix<double, 1u>, std::allocator<viennacl::compressed_matrix<double, 1u> > >::~vector (this=0x7fffffffc800, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_vector.h:424 #11 viennacl::linalg::block_ilu_precond<viennacl::compressed_matrix<double, 1u>, viennacl::linalg::ilu0_tag>::~block_ilu_precond (this=0x7fffffffc5a0, __in_chrg=<optimized out>) at /home/qusai/include/viennacl/linalg/detail/ilu/block_ilu.hpp:281 #12 0x000000000040909e in main (argc=<optimized out>, argv=<optimized out>) at /home/qusai/code/steadymhd/main.cpp:249 Important to note this doesn't happen when I start my simulation from the beginning only when I resume it (strange behavior). I don't think it's my own code as the gdb trace doesn't say so and I tried changing a bunch of stuff. I am also not using any pointers of any kind myself.

qalshidi commented 6 years ago

It seems running ilu0_precond instead of block_ilu without level scheduler and ONLY in gdb works. Out of gdb it segfaults. This is a very weird error.

qalshidi commented 6 years ago

With the above change. Setting the compiler to -O2 instead of -O3 worked. Do you know how this issue could arise?

robinchrist commented 6 years ago

Which gcc version did you use?

qalshidi commented 6 years ago

$ g++ --version g++ (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609 Copyright (C) 2015 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

qalshidi commented 6 years ago

Just tried with g++-7 same issue. Thread 1 "steadymhd.exec" received signal SIGSEGV, Segmentation fault. _int_free (av=0x7ffff6ed0b20 <main_arena>, p=0x4ac4cf0, have_lock=0) at malloc.c:3984 3984 malloc.c: No such file or directory. (gdb) backtrace #0 _int_free (av=0x7ffff6ed0b20 <main_arena>, p=0x4ac4cf0, have_lock=0) at malloc.c:3984 #1 0x00007ffff6b9053c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968 #2 0x0000000000410559 in viennacl::tools::shared_ptr<char>::dec ( this=0x7fffbb6e69b8) at /home/qusai/include/viennacl/tools/shared_ptr.hpp:167 #3 viennacl::tools::shared_ptr<char>::~shared_ptr (this=0x7fffbb6e69b8, __in_chrg=<optimized out>) at /home/qusai/include/viennacl/tools/shared_ptr.hpp:120 #4 viennacl::backend::mem_handle::~mem_handle (this=0x7fffbb6e69b0, __in_chrg=<optimized out>) at /home/qusai/include/viennacl/backend/mem_handle.hpp:89 #5 viennacl::compressed_matrix<double, 1u>::~compressed_matrix ( this=0x7fffbb6e6900, __in_chrg=<optimized out>) at /home/qusai/include/viennacl/compressed_matrix.hpp:621 #6 std::_Destroy<viennacl::compressed_matrix<double, 1u> > ( __pointer=<optimized out>) at /usr/include/c++/7/bits/stl_construct.h:98 #7 std::_Destroy_aux<false>::__destroy<viennacl::compressed_matrix<double, 1u>*> ( __last=<optimized out>, __first=0x7fffbb6e6900) at /usr/include/c++/7/bits/stl_construct.h:108 #8 std::_Destroy<viennacl::compressed_matrix<double, 1u>*> ( __last=<optimized out>, __first=<optimized out>) at /usr/include/c++/7/bits/stl_construct.h:132 #9 std::_Destroy<viennacl::compressed_matrix<double, 1u>*, viennacl::compressed_matrix<double, 1u> > (__last=0x7fffbb6e6ac0, __first=<optimized out>) at /usr/include/c++/7/bits/stl_construct.h:196 #10 std::vector<viennacl::compressed_matrix<double, 1u>, std::allocator<viennacl::compressed_matrix<double, 1u> > >::~vector (this=0x7fffffffcd40, __in_chrg=<optimized out>) at /usr/include/c++/7/bits/stl_vector.h:434 #11 viennacl::linalg::block_ilu_precond<viennacl::compressed_matrix<double, 1u>, viennacl::linalg::ilu0_tag>::~block_ilu_precond (this=0x7fffffffcae0, __in_chrg=<optimized out>) at /home/qusai/include/viennacl/linalg/detail/ilu/block_ilu.hpp:281 #12 0x0000000000407643 in main (argc=<optimized out>, argv=<optimized out>) at /home/qusai/code/steadymhd/main.cpp:248

-O2 issue is false seems like it only works sometimes.

karlrupp commented 6 years ago

Hi, could you please verify that your code is valgrind clean? Whenever I've seen such weird things happening in the past, it was due a memory corruption.

robinchrist commented 6 years ago

Can you provide a minimal, reproducible example, @qalshidi?

qalshidi commented 6 years ago

Valgrind output here. Seems to be a lot of noise. No leaks. It also went further than a regular run just like the weird behavior in gdb.

https://pastebin.ca/3944688

I'm unsure how to reproduce this that's why this is so strange. I do a bunch of armadillo vector resizing in the case it crashes as opposed to the regular run I dunno if that is causing the memory issue. But again, gdb clearly shows the error comes from the shared_ptr implementation in ViennaCL itself. Is there any reason why C++11 shared_ptr isn't used now that it is widely available?

qalshidi commented 6 years ago

So I restarted the run with better valgrind options: ` valgrind --log-file="val.out" --leak-check=full --show-leak-kinds=all steadymhd.exec square-bottom2 -max_time_steps 300000 -save_every 1000 -use_input_sol ===Begin Solar Chromosphere MHD Simulation=== Device Info: Name: Tesla P100-PCIE-12GB Vendor: NVIDIA Corporation Type: GPU Available: 1 Max Compute Units: 56 Max Work Group Size: 1024 Global Mem Size: 12786073600 Local Mem Size: 49152 Local Mem Type: 1 Host Unified Memory: 0 OMP Max threads: 32 OMP Devices: 0/0 Parameters: Grid size: 500x250 n_0: 1e+20 m^-3 | r_0: 2.1e+06 m | t_0: 9.62771 s | V_0: 218120 m/s | B_0: 1 kG | T_0: 5.76377e+06 K | p_0: 7957.75 Pa Width: 1.5e+07 m | Height: 2.1e+06 m | dx: 60000 m | dz: 4200 m | ds: 4200 m Start dt: 1e-08 t_0 Initiating vectors and matrices ... Compiling kernels ... built. Sum BE: 3184.38 | Sum KE: 0 | Sum UE: 38081.2 E: 41265.6

300001: GMRES(1,0) | Build Time:25.0085s | Sol Time:5.75029s Avg divB: 0.0347906 dt: 2.4041e-06 t_0 | Time: 9.62771e-08s | Max V: 83.1911 V_0 Sum BE: 75682.5 | Sum KE: 21381.4 | Sum UE: 38077.9 E: 135142

300002: GMRES(1,0) | Build Time:21.1272s | Sol Time:5.61742s Avg divB: 0.0336784 dt: 2.77658e-06 t_0 | Time: 2.32423e-05s | Max V: 72.0311 V_0 Sum BE: 68938.3 | Sum KE: 19062.6 | Sum UE: 38092.3 E: 126093

300003:Killed `

And this is valgrind's output. I killed it because it moves further than where it usually segfaults. https://www.dropbox.com/s/yqcva64roujifq8/val.out?dl=0

karlrupp commented 6 years ago

@qalshidi Thanks for the check. The output looks indeed okay.

Is there any chance that you can provide a small example so that we can reproduce the problem?

Is there any reason why C++11 shared_ptr isn't used now that it is widely available?

Backwards-compatibility with C++03.

qalshidi commented 6 years ago

I don't know how to reproduce the problem except with my own simulation.

qalshidi commented 6 years ago

I fixed the issue :). The issue was armadillo was handling memory wrongly and I had the ARMA_NO_DEBUG flag on so I was blind to the issue. My recommendation to people using ViennaCL for GPU acceleration is not to worry about armadillo speed and keep the debug flag on. Thanks for your help and thanks for this great library.

karlrupp commented 6 years ago

ok, so it was indeed memory corruption ;-)

Thanks for letting us know.