milc-qcd / milc_qcd

MILC collaboration code for lattice QCD calculations
Other
37 stars 32 forks source link

Multiple run-time errors issued when MILC is compiled with address sanitizer enabled #37

Open maddyscientist opened 4 years ago

maddyscientist commented 4 years ago

When MILC is compiled with address sanitizer enabled, multiple run-time errors are found when running the NERSC small RHMD benchmark on 3 processes.

The first issue is found in ranstuff.c, and looks like it is simply a case that seed is being given a number that exceeds that of what is representable in a 32-bit integer.

LAYOUT = Hypercubes, options = hyper_prime,
QMP with automatic hyper_prime layout
ON EACH NODE (RANK) 18 x 18 x 18 x 12
../generic/ranstuff.c:75:27: runtime error: signed integer overflow: 4563421 * 1749223 cannot be represented in type 'int'
../generic/ranstuff.c:77:27: runtime error: signed integer overflow: -1903219036 * 1749223 cannot be represented in type 'int'
../generic/ranstuff.c:79:27: runtime error: signed integer overflow: -806615499 * 1749223 cannot be represented in type 'int'
../generic/ranstuff.c:81:27: runtime error: signed integer overflow: -2086651380 * 1749223 cannot be represented in type 'int'
../generic/ranstuff.c:83:27: runtime error: signed integer overflow: -759901939 * 1749223 cannot be represented in type 'int'
../generic/ranstuff.c:85:27: runtime error: signed integer overflow: -1405893900 * 1749223 cannot be represented in type 'int'
../generic/ranstuff.c:87:27: runtime error: signed integer overflow: -981149083 * 1749223 cannot be represented in type 'int'
../generic/ranstuff.c:89:27: runtime error: signed integer overflow: -1085755044 * 1749223 cannot be represented in type 'int'
Mallocing 109.7 MBytes per node for lattice

The second issue is in io_lat4.c, and appears to be a similar 32-bit overflow issue.

mass 0.5
naik_term_epsilon 0
error_for_propagator 1e-08
rel_error_for_propagator 0
reload_parallel 18x18x18x36.chklat
forget 
../generic/io_lat4.c:1495:43: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
../generic/io_lat4.c:1496:43: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
../generic/io_lat4.c:1496:43: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
../generic/io_lat4.c:1495:43: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
../generic/io_lat4.c:1496:43: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
../generic/io_lat4.c:1495:43: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
Restored binary gauge configuration in parallel from file 18x18x18x36.chklat
Time stamp Wed Nov  4 17:32:43 2015
Checksums 63b670e1 16bbc0f1 OK
Time to reload gauge configuration = 6.549597e-02
maddyscientist commented 4 years ago

I should add, to enable compilation with address sanitizer (ASAN) and undefined behaviour sanitizer (UBSAN), the changes to the Makefile are trivial (supported on both clang and modern gcc)

CDEBUG += -fsanitize=address,undefined
LDFLAGS += -fsanitize=address,undefined