Closed fwinter closed 9 years ago
Just looked at the tunecache.tsv file and that looks unusual to me. Did you try what happens if you let the run create a fresh tunecache ?
That's what I did when it crashed. For clarification: The file was created by the same Quda version as stated above.
Ok, my first guess here: The tunecache.tsv file looks strange. It is read using sprint() which is know to be prone for buffer overflows.
A different QUDA version should have stopped because we check for the quda hash in the first line of the tunecache file
tunecache 0.7.0 cpu_arch=x86_64,gpu_arch=sm_35,cuda_version=6050 # Last updated Wed Nov 26 11:19:27 2014
For me the first line with actual data reads like
N4quda10PackSpinorIddLi4ELi3ENS_11FloatNOrderIdLi4ELi3ELi2EEENS_16QDPJITDiracOrderIdLi4ELi3EEENS_11NonRelBasisIddLi4ELi3EEEEE 224 1 1 494 1 1 43200 # 0.00 Gflop/s, 150.70 GB/s, tuned Wed Nov 26 11:15:59 2014
while it should read like
12x24x24x24 N4quda8xmyNorm2Id6float2S1_EE vol=165888,stride=172800,precision=4 352 1 1 69 1 1 2816 # 32.42 Gflop/s, 129.67 GB/s, tuned Wed Nov 26 14:15:05 2014
So for some reason some parts of the volume are missing. The resulting buffer overflow is than obvious with the use of sprintf which we should change to something like snprintf() or another version that is not prone to buffer overflows.
The real question remaining is why was a corrupt tunecache.tsv written in the first place.
Ok, I see the problem here. Will fix this when I return from my mini vacation in the desert (Sunday).
I've pushed a couple of changes that should address this bug:
Frank, can you please retest and confirm the issue is no longer present. If nothing else, any overflow issues should be reported now with the use of snprintf?
I see the issue that was fixed in commit 859421f47a40bff0959074fc4adff62eb80d426c was most likely the trigger for the overflow. Nevertheless, the changes I just made will make the tuning much more robust.
Recent master (becc781a) crashes for me with
* buffer overflow detected *
during loading of the tunecache file. This file was created by a previous Chroma+Quda run which didn't make it till endQuda thus got created and updated only a few times. That was a 4 x K40m node, geom 1 1 1 4, 24^3x64. GCC 4.8.2, CUDA 6.5, Driver 340.29
Backtrace:
* buffer overflow detected *:
0 0x00007ffff3d07925 in raise () from /lib64/libc.so.6
1 0x00007ffff3d09105 in abort () from /lib64/libc.so.6
2 0x00007ffff3d45837 in __libc_message () from /lib64/libc.so.6
3 0x00007ffff3dd7827 in __fortify_fail () from /lib64/libc.so.6
4 0x00007ffff3dd5710 in __chk_fail () from /lib64/libc.so.6
5 0x00007ffff3dd4b69 in _IO_str_chk_overflow () from /lib64/libc.so.6
6 0x00007ffff3d49939 in _IO_default_xsputn_internal () from /lib64/libc.so.6
7 0x00007ffff3d1d490 in vfprintf () from /lib64/libc.so.6
8 0x00007ffff3dd4c0d in __vsprintf_chk () from /lib64/libc.so.6
9 0x00007ffff3dd4b4f in __sprintf_chk () from /lib64/libc.so.6
10 0x0000000001e6fa1c in quda::deserializeTuneCache(std::basic_istream<char, std::char_traits >&) () at /usr/include/bits/stdio2.h:35
11 0x0000000001e71bdb in quda::loadTuneCache(QudaVerbosity_s) () at tune.cpp:174
12 0x0000000001d8d9bc in initQudaMemory () at interface_quda.cpp:423
I uploaded the tunecache file to dropbox:
https://www.dropbox.com/s/4zotb81oiqenzjv/tunecache.tsv
I could reproduce the issue with a boiled-down Quda application that only initializes the device and memory.