mom-ocean / MOM6

Modular Ocean Model
Other
185 stars 231 forks source link

Tests that run model test cases under valgrind. #148

Closed nichannah closed 8 years ago

nichannah commented 9 years ago

Valgrind has been shown to be a useful tool to find use of uninitialized variables. Using uninitialized variables most often leads to unreproducible results because garbage can be read out of memory.

This issue proposes an automated way to run the test cases under valgrind. This will allow bugs of this kind to be found quickly.

See also #149

adcroft commented 9 years ago

Do you have an automated way in mind?

nichannah commented 9 years ago

I do, although would like to discuss it. I'm doing #147 within Python (there's a bit of string and file manipulation which I think Python is good at). I can see this one fitting into a similar setup. I'll show you once I've got #147 in shape.

nichannah commented 9 years ago

Valgrind can be run manually on gaea for any test case that will run on a single PE. For example:

  1. build a debug executable (otherwise the exe will contain instructions that valgrind doesn't understand, probably vector math stuff)
  2. module load valgrind
  3. export TMPDIR=/lustre/f1/$USER/tmp
  4. aprun -n 1 valgrind --gen-suppressions=all --log-file=valgrind_log.txt --suppressions=../../../MOM6.supp ../../build/gnu/ocean_only/debug/MOM6

The suppressions file tells valgrind which errors to ignore. I'll share mine once I've completed it.

Then look in valgrind_log.txt to see memory errors.

For test cases that need multiple PEs valgrind generates millions of false-positives from within MPI. I'm in the process of figuring out how to filter these out properly (making a suppressions file is not feasible).

nichannah commented 9 years ago

From what I can gather, it's going to be tricky to properly run mulit-PE test cases with valgrind on gaea. Usually Valgrind would handle calls to MPI by replacing the MPI library with wrappers that do certain checks before making the actual calls. This replacement is only possible if the MPI library is dynamically linked. It seems that running executables with dynamic libraries is not easy/supported on gaea. For a start the compute nodes don't have access to the filesystem where most dynamic libraries reside (netcdf, hdf, z, math, etc). Also I can't find a way to make the ftn compiler link some libraries as dynamic and others as static.

I still have a couple of things to try.

nichannah commented 9 years ago

I've given up on running valgrind on gaea due to gaea limitation with shared libraries. Instead I'll try to run it on raijin, supercomputer on Canberra, Aus.

nichannah commented 9 years ago

I'll run these tests on the Aus computer. The output will be published here:

https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/

This is what it looks like for MOM5, I think it can be cleaned up a lot (this file is ~300Mb).

https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM5_valgrind/lastBuild/console

nichannah commented 9 years ago

https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM6_valgrind/

nichannah commented 9 years ago

The Valgrind tests are not yet all running, but I thought it would be good to document any errors as I see them....

In global_ALE/z:

==20891== Invalid read of size 8 ==20891== at 0x53FA19: mom_tracer_hor_diff_mp_tracerhordiff (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x989017: mom_mp_stepmom (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x72148F: MAIN (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== Address 0x25349540 is 16 bytes before a block of size 256 free'd ==20891== at 0x4C27C44: free (vg_replace_malloc.c:473) ==20891== by 0xF0231A: forfree_vm (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0xF099AD: for_write_int_fmt_xmit (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0xB6CBF8: fms_io_mod_mp_get_filename (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0xB497CE: fms_io_mod_mp_read_data_2dnew (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x735E82: mom_surface_forcing_mp_buoyancy_forcing_fromfiles (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x72B5A2: mom_surface_forcing_mp_setforcing (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x7213A0: MAIN__ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== {

Memcheck:Addr8 fun:mom_tracer_hor_diff_mp_tracer_hordiff_ fun:mom_mp_step_mom_ fun:MAIN__ fun:main } ==20891== Conditional jump or move depends on uninitialised value(s) ==20891== at 0x5EA68C: mom_restart_mp_save_restart_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x9A1DF1: mom_mp_initialize_mom_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x72062C: MAIN__ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== Uninitialised value was created by a heap allocation ==20891== at 0x4C2826A: malloc (vg_replace_malloc.c:296) ==20891== by 0xF024D3: for_allocate (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x5EEAD5: mom_restart_mp_restart_init_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x99F542: mom_mp_initialize_mom_ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x72062C: MAIN__ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== { Memcheck:Cond fun:mom_restart_mp_save_restart_ fun:mom_mp_initialize_mom_ fun:MAIN__ fun:main }
nichannah commented 8 years ago

These can be found here: https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM6_runtime_analyzer/