Closed blakejohnson closed 8 years ago
Thanks for the report. I've never seen this, so if you really have no objections it would be useful for me to try your code. It sounds like you've tried the obvious things. Did it used to work without a segfault, or is it simply that you've only been using @sprofile
in this fashion recently?
This might be an OSX vs Linux thing, of course. Since I was writing it on Linux I went with its most modern timer API, then later learned that it's not available on BSD. But Linux can also use the same older timer API as OSX. So if I can't reproduce the segfault I'll compile a special version and test the OSX timers. I did that when I first developed the OSX support, but I bet I didn't give it the same kind of thorough workout you're giving it.
Sure, the code is here: https://gist.github.com/blakejohnson/5600044
Running this causes my machine to segfault:
require("SimSpeedTest.jl")
require("Profile")
using SProfile
@sprofile fot ct=1:10 run_sim(); end
I'll definitely have to try the BSD timers. I was able to run that last line ~6 times without issue.
Hopefully this weekend.
I'll also have a look on another computer to see if it is something peculiar to my work machine.
--Blake
On Fri, May 17, 2013 at 3:37 PM, Tim Holy notifications@github.com wrote:
I'll definitely have to try the BSD timers. I was able to run that last line ~6 times without issue.
Hopefully this weekend.
— Reply to this email directly or view it on GitHubhttps://github.com/timholy/Profile.jl/issues/14#issuecomment-18081921 .
FWIW, here is an example backtrace upon segfault:
#0 0x00007fff8916a2b2 in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
#1 0x00007fff8916a1fd in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncoding ()
#2 0x00007fff8916a171 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step ()
#3 0x0000000100097a66 in rec_backtrace ()
#4 0x00000001018f2d18 in sprofile_bt ()
#5 <signal handler called>
#6 0x00000001048259d1 in zhemv_ ()
#7 0x00000001050357a7 in zhetd2_ ()
Ok, I can confirm the issue on another Mac, this one with OS 10.8.2. The segfault backtrace is basically the same.
OK, I commented out the Linux timers and rebuilt. I still don't get the segfault. Then I ran it under valgrind as
valgrind --track-origins=yes /home/tim/src/julia/usr/bin/julia-debug-readline
.
I got these errors: ==1537== Thread 2: ==1537== Syscall param msync(start) points to unaddressable byte(s) ==1537== at 0x50684CD: ??? (syscall-template.S:82) ==1537== by 0x5E69ADE: msync_validate (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E69C0B: validate_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E69D48: access_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E67A2C: dwarf_get (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E67CEA: _ULx86_64_access_reg (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E675AA: _ULx86_64_get_reg (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6C9FB: apply_reg_state (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6D1AA: _ULx86_64_dwarf_find_save_locs (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E691BB: _ULx86_64_dwarf_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E67E97: _ULx86_64_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x53A5987: rec_backtrace (task.c:506) ==1537== Address 0xd99d000 is not stack'd, malloc'd or (recently) free'd ==1537== ==1537== Syscall param msync(start) points to uninitialised byte(s) ==1537== at 0x50684CD: ??? (syscall-template.S:82) ==1537== by 0x5E69ADE: msync_validate (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E69C0B: validate_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E69D48: access_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6AB8C: dwarf_get (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6CC5A: apply_reg_state (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6D1AA: _ULx86_64_dwarf_find_save_locs (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E691BB: _ULx86_64_dwarf_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E67E97: _ULx86_64_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x53A5987: rec_backtrace (task.c:506) ==1537== by 0xE3FE8B7: sprofile_bt (profile.c:27) ==1537== by 0x646E49F: ??? (in /lib/x86_64-linux-gnu/libc-2.15.so) ==1537== Address 0xd99e000 is on thread 2's stack ==1537== Uninitialised value was created by a stack allocation ==1537== at 0x53A58E1: rec_backtrace (task.c:499) ==1537==
See also: https://github.com/JuliaLang/julia/issues/782
I don't know what to do here. Looks like it could be a bug in libunwind. CCing @JeffBezanson.
I also get the feeling that it is a libunwind bug.
On the Mac, valgrind crashes before it finishes loading the Julia REPL. So, I can't show similar output...
Hmm, that's not good!
Sounds like one of us should submit a bug report to libunwind. Do you do C?
Right. Reducing this to a bug report that doesn't require Julia is the trick. Unfortunately, it also looks like Apple's updates to libunwind have not been accepted upstream. So, do we submit a bug with Apple or the libunwind project?
Ugh.
I believe the reason libunwind is calling msync
here is just to check its return code to see if a memory region is valid. So it is expected that it will sometimes pass invalid addresses to msync
.
Right. Reducing this to a bug report that doesn't require Julia is the trick. Unfortunately, it also looks like Apple's updates to libunwind have not been accepted upstream. So, do we submit a bug with Apple or the libunwind project?
Dunno. I assume we get "ours" straight from libunwind? In that case, I'd consider starting with libunwind and see where that gets us.
@JeffBezanson, thanks for checking. Obviously it might be more informative to run it on the mac, if valgrind weren't crashing.
It turns out that we only build libunwind on linux and FreeBSD. So, this must be a bug in Apple's implementation. I think Apple still runs their radar bug tracker. So, I guess I will file there.
Thanks for continuing to pursue this! Profile users will owe you their thanks.
CC @ViralBShah.
We can certainly build a libunwind on OS X, if there is a patched version that works.
So, I've been looking more into this over the last few days. It's difficult to debug because libunwind is very low-level. I've only really unearthed two new pieces of information:
That does sound like progress. Viral, you know a lot about Julia's build process, is this theory likely?
I do not have any more insight here. Since gfortran is provided by the user, I can imagine that different builds link to different libraries. I personally have now moved towards using brew
for all the dependencies to build julia. OpenBLAS and gfortran do not seem to be linking to anything weird though:
otool -L libopenblas.dylib
libopenblas.dylib:
@rpath/libopenblas.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgfortran.3.dylib (compatibility version 4.0.0, current version 4.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)
/usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libquadmath.0.dylib (compatibility version 1.0.0, current version 1.0.0)
OpenBLAS does lots of tricks under the hood, and perhaps @xianyi can tell us if it is possible that it is likely to interfere with profiling.
Also, the last I had checked (probably a year ago, when we got backtrace), the upstream libunwind
does not build on the mac, and you have to use Apple's. Perhaps that has changed now.
I notice that libprofile is built on gcc with mac, instead of with clang. Could this be an issue?
I built with clang and still get the segfault.
If you then do otool -L /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgfortran.3.dylib
you will notice that it links to '/usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgcc_s.1.dylib'. Whereas libjulia-release.dylib links to /usr/lib/libstdc++.6.dylib. Is this a problem?
I'll check what this looks like on linux.
On Ubuntu 12.04:
ldd libopenblas.so
linux-gate.so.1 => (0xb76f8000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb6a02000)
libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb69e7000)
libgfortran.so.3 => /usr/lib/i386-linux-gnu/libgfortran.so.3 (0xb68ee000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb6744000)
/lib/ld-linux.so.2 (0xb76f9000)
libquadmath.so.0 => /usr/lib/i386-linux-gnu/libquadmath.so.0 (0xb66cc000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb66ae000)
ldd /usr/lib/i386-linux-gnu/libgfortran.so.3
linux-gate.so.1 => (0xb76ed000)
libquadmath.so.0 => /usr/lib/i386-linux-gnu/libquadmath.so.0 (0xb7561000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7535000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7516000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb736c000)
/lib/ld-linux.so.2 (0xb76ee000)
ldd libjulia-release.so
linux-gate.so.1 => (0xb77bf000)
libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xb681a000)
librt.so.1 => /lib/i386-linux-gnu/librt.so.1 (0xb6811000)
libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb67f5000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb670c000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb66e0000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb66c2000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb6518000)
/lib/ld-linux.so.2 (0xb77c0000)
So, both libgfortran and libjulia link to libgcc_s.so.1. This is at least different from the mac.
That doesn't seem to be a problem to me. The gfortran is not provided with XCode and hence it brings its own version of libgcc
. libjulia needs libstdc++
which comes from the C toolchain. It seems that the linking and loading is happening correctly.
Alright, I guess it is time to look in a different direction. My test script spends most of its time in OpenBLAS, so perhaps it is just a coincidence that that I never seen it crash outside of OpenBLAS. I'll try some of the perf2 demos that are pure Julia and see if I can get the profiler to crash.
Well, I cannot get the laplace or Go benchmarks to segfault during profiling. So, perhaps OpenBLAS or ccall are important.
@xianyi Any wisdom from you here will be useful. This is a blocker for having a sampling profiler as part of julia base, since it crashes when openblas is executing.
Hi @blakejohnson , @ViralBShah ,
Could you try the latest develop branch? We fixed some bugs in level-2 BLAS.
Is it multi-threaded OpenBLAS? Could you try the single thread? Or, build OpenBLAS with USE_OPENMP=1
Xianyi
The problem exists with both OPENBLAS_NUM_THREADS=1 and values greater than 1. @ViralBShah what's the easiest way to build agains the OpenBLAS develop branch? We pull in OpenBLAS as a tar.gz file rather than as a submodule. I suppose we just need to point to the appropriate zip file...
I believe if you change the version to develop
in Versions.make, it may work. Haven't tried this in a while.
Changing the version to develop
does work. Building against the develop branch of OpenBLAS changed the behavior a little bit. Whereas my test script would always cause a segfault, it now occasionally segfaults. When it does segfault, zhemv_
is the last OpenBLAS call on the stack.
So, is there a chance we are looking at some subtle conflict between Apple's libunwind and OpenBLAS? I suggest it only because libunwind seems to work by inspecting certain registers, and OpenBLAS contains hand-rolled assembly code which might modify registers in a way that is unexpected by libunwind.
That is what I suspect - but that is a rather hand wavy explanation.
However, zhemv_ is written in C. OpenBLAS/interface/zhemv.c
Fair enough. I have also seen in segfault in zaxpy_
, zdscal_
, and zheevr_
.
Could it have to do with openblas possibly having been compiled with -fomit-frame-pointer
?
As a test case, could you try profiling code that does not link to BLAS, but calls other C libraries - say fft
?
Looks like I can also get a segfault while profiling fft
. It happens much less frequently, though. I'm trying to pull a backtrace now.
This is why the julia
src is compiled with -fno-omit-frame-pointer
, so that libunwind can work correctly with it. Probably, if we want profiling to work with all the libraries, they all will need this option too. It's a bit of work, but shouldn't be too difficult.
What about trying the same with rand
?
It took quite a few iterations in gdb, but finally got a segfault and backtrace. Strangely, it didn't bail in FFTW:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xfffffffffffffff0
0x00007fff8def02b2 in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
(gdb) bt
#0 0x00007fff8def02b2 in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
#1 0x00007fff8def01fd in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncoding ()
#2 0x00007fff8def0171 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step ()
#3 0x0000000100099646 in rec_backtrace ()
#4 0x0000000100ff3d18 in sprofile_bt ()
#5 <signal handler called>
#6 0x00007fff8e8791c6 in gettimeofday ()
At least, I don't think gettimeofday
is in FFTW.
@ViralBShah, this seems like great detective work. But things like gettimeofday
probably can't be compiled with -fno-omit-frame-pointer
.
Interestingly, rand
does not bail either, and it is compiled with -fomit-frame-pointer
, as is fftw
. I am still inclined to try build OpenBLAS with this flag or in debug mode. It is likely that we are not getting an accurate stacktrace from libopenblas since it is compiled with optimizations.
Another alternative would be to try using the Apple BLAS with USE_SYSTEM_BLAS=1
.
I wonder what other profilers do with stuff like gettimeofday
.
The libunwind
manpage seems to warn against calling stuff from a signal handler for thread safety. Would that be an issue here, since openblas uses threads? Perhaps we can try by compiling openblas single-threaded.
I have tried all of the above things with OpenBLAS and it still segfaults. I am out of ideas now.
I get a segfault with Apple BLAS too.
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xffffffffffffffef
0x00007fff89f1965c in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
(gdb) bt
#0 0x00007fff89f1965c in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
#1 0x00007fff89f1958f in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncoding ()
#2 0x00007fff89f1950e in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step ()
#3 0x0000000100097cd6 in rec_backtrace ()
#4 0x0000000100ef3c10 in sprofile_bt (dummy=14) at profile.c:27
#5 <signal handler called>
#6 0x000000010481d320 in zaxpy_ ()
#7 0x0000000104f0d4f9 in zhetd2_ ()
I've been trying to see what gperftools does for stack unwinding. It looks like they switch between some code that requires frame pointers and libunwind. See line 58 of: https://code.google.com/p/gperftools/source/browse/trunk/src/stacktrace_config.h
So, maybe frame pointers are not required by libunwind?
Compiling openblas with frame pointers did not help - so perhaps that is not why we are seeing the crash. It is potentially just a bug in libunwind.
Has anyone tried this on linux? Do we get a segfault on linux too?
It works just fine on Linux. If we can construct a simplified segfaulting example, then we can submit a bug report to Apple.
On Saturday, June 15, 2013 at 8:29 AM, Viral B. Shah wrote:
Compiling openblas with frame pointers did not help - so perhaps that is not why we are seeing the crash. It is potentially just a bug in libunwind. Has anyone tried this on linux? Do we get a segfault on linux too?
— Reply to this email directly or view it on GitHub (https://github.com/timholy/Profile.jl/issues/14#issuecomment-19495764).
I keep running into an issue where running
@sprofile
on the same function several times causes Julia to segfault. I am not even sure where to start in finding the origin of this issue. I am running Mac OS 10.7.5. I rebuild Julia daily, and this problem has existed at least for a few weeks. I also rebuilt the Profile.jl library withPkg.runbuildscript("Profile")
.If it would help to post the code I am running, I would be happy to do so.