Closed nichannah closed 8 years ago
Do you have an automated way in mind?
I do, although would like to discuss it. I'm doing #147 within Python (there's a bit of string and file manipulation which I think Python is good at). I can see this one fitting into a similar setup. I'll show you once I've got #147 in shape.
Valgrind can be run manually on gaea for any test case that will run on a single PE. For example:
The suppressions file tells valgrind which errors to ignore. I'll share mine once I've completed it.
Then look in valgrind_log.txt to see memory errors.
For test cases that need multiple PEs valgrind generates millions of false-positives from within MPI. I'm in the process of figuring out how to filter these out properly (making a suppressions file is not feasible).
From what I can gather, it's going to be tricky to properly run mulit-PE test cases with valgrind on gaea. Usually Valgrind would handle calls to MPI by replacing the MPI library with wrappers that do certain checks before making the actual calls. This replacement is only possible if the MPI library is dynamically linked. It seems that running executables with dynamic libraries is not easy/supported on gaea. For a start the compute nodes don't have access to the filesystem where most dynamic libraries reside (netcdf, hdf, z, math, etc). Also I can't find a way to make the ftn compiler link some libraries as dynamic and others as static.
I still have a couple of things to try.
I've given up on running valgrind on gaea due to gaea limitation with shared libraries. Instead I'll try to run it on raijin, supercomputer on Canberra, Aus.
I'll run these tests on the Aus computer. The output will be published here:
https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/
This is what it looks like for MOM5, I think it can be cleaned up a lot (this file is ~300Mb).
https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM5_valgrind/lastBuild/console
The Valgrind tests are not yet all running, but I thought it would be good to document any errors as I see them....
In global_ALE/z:
==20891== Invalid read of size 8 ==20891== at 0x53FA19: mom_tracer_hor_diff_mp_tracerhordiff (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x989017: mom_mp_stepmom (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x72148F: MAIN (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== Address 0x25349540 is 16 bytes before a block of size 256 free'd ==20891== at 0x4C27C44: free (vg_replace_malloc.c:473) ==20891== by 0xF0231A: forfree_vm (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0xF099AD: for_write_int_fmt_xmit (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0xB6CBF8: fms_io_mod_mp_get_filename (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0xB497CE: fms_io_mod_mp_read_data_2dnew (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x735E82: mom_surface_forcing_mp_buoyancy_forcing_fromfiles (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x72B5A2: mom_surface_forcing_mp_setforcing (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x7213A0: MAIN__ (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== by 0x40C23B: main (in /short/v45/nah599/more_home/MOM6-examples/build/intel/ocean_only/debug/MOM6) ==20891== {
These can be found here: https://climate-cms.nci.org.au/jenkins/job/mom-ocean.org/job/MOM6_runtime_analyzer/
Valgrind has been shown to be a useful tool to find use of uninitialized variables. Using uninitialized variables most often leads to unreproducible results because garbage can be read out of memory.
This issue proposes an automated way to run the test cases under valgrind. This will allow bugs of this kind to be found quickly.
See also #149