uwplse / herbgrind

A Valgrind tool for Herbie
GNU General Public License v3.0
90 stars 7 forks source link

Type Assert Fail in i32Uto64 #39

Closed joconnor22 closed 5 years ago

joconnor22 commented 5 years ago

Issue Testing on DualSPHysics v4.2 (smoothed particle hydrodynamics code for flow simulation) example case results in a type assertion fail in i32Uto64:

Herbgrind: runtime/shadowop/conversions.c:437 (i32Uto64): Assertion 't->values[0]->type == Vt_Single' failed.

host stacktrace:
==10881==    at 0x5807E95A: show_sched_status_wrk (m_libcassert.c:369)
==10881==    by 0x5807EA67: report_and_quit (m_libcassert.c:440)
==10881==    by 0x5807EC01: vgPlain_assert_fail (m_libcassert.c:506)
==10881==    by 0x58064B83: i32Uto64 (conversions.c:437)
==10881==    by 0x1006B0E726: ???
==10881==    by 0x1002F8DF2F: ???
==10881==    by 0x60289C: JVtk::JVtkArray::WriteBinaryInt32(std::basic_ofstream<char, std::char_traits<char> >&) const (in /home/joe/DualSPHysics/bin/linux/DualSPHysics4.2CPU_linux64)

This occurs in the initialisation before the bulk of the calculations are performed. I've tested a range of compiler settings and they all produce the same error; apart from the -O0 option which causes the same error but in a different initialisation routine slightly later on.

To get around this I did try passing the -start-off flag to herbgrind with the HERBGRIND_BEGIN() and HERBGRIND_END() macros so that I could bypass the initialisation. However this seemed to have no apparent effect and it failed with the same error in the same place (I think I'm probably missing something with how this flag works). I also tried commenting out the part of the code that calls the JVtk::JVtkArray::WriteBinaryInt32 routine since it's not a critical part of the initialisation. However, this just shifted the error to a later initialisation routine.

Steps to Reproduce

  1. Clone the DualSPHysics repo
  2. Change line 49 in the dam break example CPU run script (examples/main/01_DamBreak/xCaseDambreakVal2D_linux64_CPU.sh) to:
    /opt/herbgrind/valgrind/herbgrind-install/bin/valgrind --tool=herbgrind $dualsphysicscpu $dirout/$name $dirout -dirdataout data -svres
  3. Run the dam break example CPU run script (examples/main/01_DamBreak/xCaseDambreakVal2D_linux64_CPU.sh)

The compiler options can be changed by modifying src/source/Makefile_cpu and rebuilding.

pavpanchekha commented 5 years ago

Hi Joseph!

Sorry for the slow response on your bug report, and thank you for such a complete bug report. The Herbgrind lead author is out of the country right now, but we'll try to take a look at this bug when he gets back. Thanks again for the bug report!

HazardousPeach commented 5 years ago

Hey Joseph! Thanks for the bug report. I'll try to see if I can debug it tomorrow.

HazardousPeach commented 5 years ago

Hmm, so far seems difficult to reproduce, the version of Valgrind that ships with Herbgrind doesn't seem to be able to handle the instructions that DualSPhysics compiles to on my server machine. I'm going to try another machine, and look into updating the version of Valgrind we use.

joconnor22 commented 5 years ago

OK - thanks very much for your efforts!

HazardousPeach commented 5 years ago

I've made a few type system and low-level fixes for this benchmark. Fixed the original problem you were running into, but am still tracking down a segfault I get somewhere in our wrapping of the "pow" library function. You can get the fixes so far on the branch newer-valgrind.

joconnor22 commented 5 years ago

Hi Alex, thanks a lot for your work. I can confirm that the initial issue is fixed and now there seems to be a segfault issue occurring:

MapCells=(142,1,319)
==15869== 
==15869== Process terminating with default action of signal 11 (SIGSEGV)
==15869==  Access not within mapped region at address 0x28
==15869==    at 0x65358D: JFormatFiles2::SaveVtkShapes(std::string, std::string const&, std::string const&, std::vector<JFormatFiles2::StrShapeData, std::allocator<JFormatFiles2::StrShapeData> > const&) (in /home/joe/DualSPHysics/bin/linux/DualSPHysics4.2CPU_linux64)
==15869==    by 0x50D01F: JSph::SaveMapCellsVtk(float) const (in /home/joe/DualSPHysics/bin/linux/DualSPHysics4.2CPU_linux64)
==15869==    by 0x50D797: JSph::ConfigCellDivision() (in /home/joe/DualSPHysics/bin/linux/DualSPHysics4.2CPU_linux64)
==15869==    by 0x63F7D8: JSphCpuSingle::ConfigDomain() (in /home/joe/DualSPHysics/bin/linux/DualSPHysics4.2CPU_linux64)
==15869==    by 0x642A95: JSphCpuSingle::Run(std::string, JCfgRun*, JLog2*) (in /home/joe/DualSPHysics/bin/linux/DualSPHysics4.2CPU_linux64)
==15869==    by 0x40525E: main (in /home/joe/DualSPHysics/bin/linux/DualSPHysics4.2CPU_linux64)
==15869==  If you believe this happened as a result of a stack
==15869==  overflow in your program's main thread (unlikely but
==15869==  possible), you can try to increase the size of the
==15869==  main thread stack using the --main-stacksize= flag.
==15869==  The main thread stack size used in this run was 8388608.
==15869== 
Didn't find any marks!
/usr/local/bin/herbgrind: line 17: 15869 Segmentation fault      (core dumped) $HG "$@"
HazardousPeach commented 5 years ago

Okay, I think I fixed all failures in this benchmark, but it's still a pretty big program and takes more memory than I have to fully analyze. All fixes can be found in the newer-valgrind branch, soon to be in master.

joconnor22 commented 5 years ago

Hi Alex, thanks a lot for this. Yes the dam break benchmark problem is too big to analyse but I only included it in my original issue as it was the minimum amount of steps required for you to reproduce the issue I was having. I have tested it on a much smaller problem and I can confirm it runs all the way through without any issues and I get the .gh report at the end (when compiled without debug symbols).

The last problem I'm having now is when I compile with debug symbols the program executes fine; however, at the end when herbgrind is writing out the report there seems to be an issue:

==6028== 

Herbgrind: helper/bbuf.c:50 (printBBuf): Assertion 'printLength < bbuf->bound' failed.
Herbgrind: trying to print 0 character past bound!

host stacktrace:
==6028==    at 0x5807EC0A: show_sched_status_wrk (m_libcassert.c:369)
==6028==    by 0x5807ED17: report_and_quit (m_libcassert.c:440)
==6028==    by 0x5807EEB1: vgPlain_assert_fail (m_libcassert.c:506)
==6028==    by 0x580520A4: printBBuf (bbuf.c:48)
==6028==    by 0x5805EC22: getAddrString (shadowop-info.c:127)
==6028==    by 0x580612B7: writeOutput (output.c:234)
==6028==    by 0x580E2F71: shutdown_actions_NORETURN (m_main.c:2171)
==6028==    by 0x5813413B: run_a_thread_NORETURN (syswrap-linux.c:203)

sched status:
  running_tid=1
HazardousPeach commented 5 years ago

Hey Joseph, It looks like your debug symbols are pretty big! That error message means that the getAddrString function, which looks up a program address to produce a human-readable line number, is overflowing it's buffer. The maximum size of address strings is currently 300 characters, but you'll have to increase that number to get your benchmark to run. You can change it by changing the #define MAX_ADDR_STRING_SIZE in src/runtime/op-shadowstate/shadowop-info.c.