lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
289 stars 97 forks source link

tunecache: *** buffer overflow detected *** #177

Closed fwinter closed 9 years ago

fwinter commented 9 years ago

Recent master (becc781a) crashes for me with

* buffer overflow detected *

during loading of the tunecache file. This file was created by a previous Chroma+Quda run which didn't make it till endQuda thus got created and updated only a few times. That was a 4 x K40m node, geom 1 1 1 4, 24^3x64. GCC 4.8.2, CUDA 6.5, Driver 340.29

Backtrace:

* buffer overflow detected *:

0 0x00007ffff3d07925 in raise () from /lib64/libc.so.6

1 0x00007ffff3d09105 in abort () from /lib64/libc.so.6

2 0x00007ffff3d45837 in __libc_message () from /lib64/libc.so.6

3 0x00007ffff3dd7827 in __fortify_fail () from /lib64/libc.so.6

4 0x00007ffff3dd5710 in __chk_fail () from /lib64/libc.so.6

5 0x00007ffff3dd4b69 in _IO_str_chk_overflow () from /lib64/libc.so.6

6 0x00007ffff3d49939 in _IO_default_xsputn_internal () from /lib64/libc.so.6

7 0x00007ffff3d1d490 in vfprintf () from /lib64/libc.so.6

8 0x00007ffff3dd4c0d in __vsprintf_chk () from /lib64/libc.so.6

9 0x00007ffff3dd4b4f in __sprintf_chk () from /lib64/libc.so.6

10 0x0000000001e6fa1c in quda::deserializeTuneCache(std::basic_istream<char, std::char_traits >&) () at /usr/include/bits/stdio2.h:35

11 0x0000000001e71bdb in quda::loadTuneCache(QudaVerbosity_s) () at tune.cpp:174

12 0x0000000001d8d9bc in initQudaMemory () at interface_quda.cpp:423

I uploaded the tunecache file to dropbox:

https://www.dropbox.com/s/4zotb81oiqenzjv/tunecache.tsv

I could reproduce the issue with a boiled-down Quda application that only initializes the device and memory.

mathiaswagner commented 9 years ago

Just looked at the tunecache.tsv file and that looks unusual to me. Did you try what happens if you let the run create a fresh tunecache ?

fwinter commented 9 years ago

That's what I did when it crashed. For clarification: The file was created by the same Quda version as stated above.

mathiaswagner commented 9 years ago

Ok, my first guess here: The tunecache.tsv file looks strange. It is read using sprint() which is know to be prone for buffer overflows.

A different QUDA version should have stopped because we check for the quda hash in the first line of the tunecache file

tunecache   0.7.0   cpu_arch=x86_64,gpu_arch=sm_35,cuda_version=6050    # Last updated Wed Nov 26 11:19:27 2014

For me the first line with actual data reads like

N4quda10PackSpinorIddLi4ELi3ENS_11FloatNOrderIdLi4ELi3ELi2EEENS_16QDPJITDiracOrderIdLi4ELi3EEENS_11NonRelBasisIddLi4ELi3EEEEE       224 1   1   494 1   1   43200   # 0.00 Gflop/s, 150.70 GB/s, tuned Wed Nov 26 11:15:59 2014

while it should read like

12x24x24x24     N4quda8xmyNorm2Id6float2S1_EE   vol=165888,stride=172800,precision=4    352     1       1       69      1       1       2816    # 32.42 Gflop/s, 129.67 GB/s, tuned Wed Nov 26 14:15:05 2014

So for some reason some parts of the volume are missing. The resulting buffer overflow is than obvious with the use of sprintf which we should change to something like snprintf() or another version that is not prone to buffer overflows.

The real question remaining is why was a corrupt tunecache.tsv written in the first place.

maddyscientist commented 9 years ago

Ok, I see the problem here. Will fix this when I return from my mini vacation in the desert (Sunday).

maddyscientist commented 9 years ago

I've pushed a couple of changes that should address this bug:

  1. sprintf has been replaced by snprintf everywhere and the return value is checked to ensure that it is completed correctly.
  2. The volume string should now be correctly printed for all kernels.
  3. A potential issue in the mixed-precision blas kernels has been resolved whereby we could have buffer overflow.

Frank, can you please retest and confirm the issue is no longer present. If nothing else, any overflow issues should be reported now with the use of snprintf?

maddyscientist commented 9 years ago

I see the issue that was fixed in commit 859421f47a40bff0959074fc4adff62eb80d426c was most likely the trigger for the overflow. Nevertheless, the changes I just made will make the tuning much more robust.