Segfaults - Githubissues

blakejohnson commented 11 years ago

I keep running into an issue where running @sprofile on the same function several times causes Julia to segfault. I am not even sure where to start in finding the origin of this issue. I am running Mac OS 10.7.5. I rebuild Julia daily, and this problem has existed at least for a few weeks. I also rebuilt the Profile.jl library with Pkg.runbuildscript("Profile").

If it would help to post the code I am running, I would be happy to do so.

timholy commented 11 years ago

Thanks for the report. I've never seen this, so if you really have no objections it would be useful for me to try your code. It sounds like you've tried the obvious things. Did it used to work without a segfault, or is it simply that you've only been using @sprofile in this fashion recently?

This might be an OSX vs Linux thing, of course. Since I was writing it on Linux I went with its most modern timer API, then later learned that it's not available on BSD. But Linux can also use the same older timer API as OSX. So if I can't reproduce the segfault I'll compile a special version and test the OSX timers. I did that when I first developed the OSX support, but I bet I didn't give it the same kind of thorough workout you're giving it.

blakejohnson commented 11 years ago

Sure, the code is here: https://gist.github.com/blakejohnson/5600044

Running this causes my machine to segfault:

require("SimSpeedTest.jl")
require("Profile")
using SProfile
@sprofile fot ct=1:10 run_sim(); end

timholy commented 11 years ago

I'll definitely have to try the BSD timers. I was able to run that last line ~6 times without issue.

Hopefully this weekend.

blakejohnson commented 11 years ago

I'll also have a look on another computer to see if it is something peculiar to my work machine.

--Blake

On Fri, May 17, 2013 at 3:37 PM, Tim Holy notifications@github.com wrote:

I'll definitely have to try the BSD timers. I was able to run that last line ~6 times without issue.

Hopefully this weekend.

— Reply to this email directly or view it on GitHubhttps://github.com/timholy/Profile.jl/issues/14#issuecomment-18081921 .

blakejohnson commented 11 years ago

FWIW, here is an example backtrace upon segfault:

#0  0x00007fff8916a2b2 in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
#1  0x00007fff8916a1fd in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncoding ()
#2  0x00007fff8916a171 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step ()
#3  0x0000000100097a66 in rec_backtrace ()
#4  0x00000001018f2d18 in sprofile_bt ()
#5  <signal handler called>
#6  0x00000001048259d1 in zhemv_ ()
#7  0x00000001050357a7 in zhetd2_ ()

blakejohnson commented 11 years ago

Ok, I can confirm the issue on another Mac, this one with OS 10.8.2. The segfault backtrace is basically the same.

timholy commented 11 years ago

OK, I commented out the Linux timers and rebuilt. I still don't get the segfault. Then I ran it under valgrind as valgrind --track-origins=yes /home/tim/src/julia/usr/bin/julia-debug-readline.

I got these errors: ==1537== Thread 2: ==1537== Syscall param msync(start) points to unaddressable byte(s) ==1537== at 0x50684CD: ??? (syscall-template.S:82) ==1537== by 0x5E69ADE: msync_validate (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E69C0B: validate_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E69D48: access_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E67A2C: dwarf_get (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E67CEA: _ULx86_64_access_reg (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E675AA: _ULx86_64_get_reg (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6C9FB: apply_reg_state (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6D1AA: _ULx86_64_dwarf_find_save_locs (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E691BB: _ULx86_64_dwarf_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E67E97: _ULx86_64_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x53A5987: rec_backtrace (task.c:506) ==1537== Address 0xd99d000 is not stack'd, malloc'd or (recently) free'd ==1537== ==1537== Syscall param msync(start) points to uninitialised byte(s) ==1537== at 0x50684CD: ??? (syscall-template.S:82) ==1537== by 0x5E69ADE: msync_validate (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E69C0B: validate_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E69D48: access_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6AB8C: dwarf_get (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6CC5A: apply_reg_state (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E6D1AA: _ULx86_64_dwarf_find_save_locs (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E691BB: _ULx86_64_dwarf_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x5E67E97: _ULx86_64_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so) ==1537== by 0x53A5987: rec_backtrace (task.c:506) ==1537== by 0xE3FE8B7: sprofile_bt (profile.c:27) ==1537== by 0x646E49F: ??? (in /lib/x86_64-linux-gnu/libc-2.15.so) ==1537== Address 0xd99e000 is on thread 2's stack ==1537== Uninitialised value was created by a stack allocation ==1537== at 0x53A58E1: rec_backtrace (task.c:499) ==1537==

I don't know what to do here. Looks like it could be a bug in libunwind. CCing @JeffBezanson.

blakejohnson commented 11 years ago

I also get the feeling that it is a libunwind bug.

blakejohnson commented 11 years ago

On the Mac, valgrind crashes before it finishes loading the Julia REPL. So, I can't show similar output...

timholy commented 11 years ago

Hmm, that's not good!

Sounds like one of us should submit a bug report to libunwind. Do you do C?

blakejohnson commented 11 years ago

Right. Reducing this to a bug report that doesn't require Julia is the trick. Unfortunately, it also looks like Apple's updates to libunwind have not been accepted upstream. So, do we submit a bug with Apple or the libunwind project?

Ugh.

JeffBezanson commented 11 years ago

I believe the reason libunwind is calling msync here is just to check its return code to see if a memory region is valid. So it is expected that it will sometimes pass invalid addresses to msync.

timholy commented 11 years ago

Right. Reducing this to a bug report that doesn't require Julia is the trick. Unfortunately, it also looks like Apple's updates to libunwind have not been accepted upstream. So, do we submit a bug with Apple or the libunwind project?

Dunno. I assume we get "ours" straight from libunwind? In that case, I'd consider starting with libunwind and see where that gets us.

timholy commented 11 years ago

@JeffBezanson, thanks for checking. Obviously it might be more informative to run it on the mac, if valgrind weren't crashing.

blakejohnson commented 11 years ago

It turns out that we only build libunwind on linux and FreeBSD. So, this must be a bug in Apple's implementation. I think Apple still runs their radar bug tracker. So, I guess I will file there.

timholy commented 11 years ago

Thanks for continuing to pursue this! Profile users will owe you their thanks.

timholy commented 11 years ago

CC @ViralBShah.

ViralBShah commented 11 years ago

We can certainly build a libunwind on OS X, if there is a patched version that works.

blakejohnson commented 11 years ago

So, I've been looking more into this over the last few days. It's difficult to debug because libunwind is very low-level. I've only really unearthed two new pieces of information:

The crash point always has OpenBLAS on the stack.
The few other similar reports I can find suggest that the problem could be using inconsistent versions of libstdc++. So, I wondered if gfortran is linking to a different libstdc than clang.

timholy commented 11 years ago

That does sound like progress. Viral, you know a lot about Julia's build process, is this theory likely?

ViralBShah commented 11 years ago

I do not have any more insight here. Since gfortran is provided by the user, I can imagine that different builds link to different libraries. I personally have now moved towards using brew for all the dependencies to build julia. OpenBLAS and gfortran do not seem to be linking to anything weird though:

 otool -L libopenblas.dylib 
libopenblas.dylib:
    @rpath/libopenblas.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgfortran.3.dylib (compatibility version 4.0.0, current version 4.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)
    /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libquadmath.0.dylib (compatibility version 1.0.0, current version 1.0.0)

OpenBLAS does lots of tricks under the hood, and perhaps @xianyi can tell us if it is possible that it is likely to interfere with profiling.

Also, the last I had checked (probably a year ago, when we got backtrace), the upstream libunwind does not build on the mac, and you have to use Apple's. Perhaps that has changed now.

ViralBShah commented 11 years ago

I notice that libprofile is built on gcc with mac, instead of with clang. Could this be an issue?

ViralBShah commented 11 years ago

I built with clang and still get the segfault.

blakejohnson commented 11 years ago

If you then do otool -L /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgfortran.3.dylib you will notice that it links to '/usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgcc_s.1.dylib'. Whereas libjulia-release.dylib links to /usr/lib/libstdc++.6.dylib. Is this a problem?

I'll check what this looks like on linux.

blakejohnson commented 11 years ago

On Ubuntu 12.04:

ldd libopenblas.so
    linux-gate.so.1 =>  (0xb76f8000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb6a02000)
    libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb69e7000)
    libgfortran.so.3 => /usr/lib/i386-linux-gnu/libgfortran.so.3 (0xb68ee000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb6744000)
    /lib/ld-linux.so.2 (0xb76f9000)
    libquadmath.so.0 => /usr/lib/i386-linux-gnu/libquadmath.so.0 (0xb66cc000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb66ae000)

ldd /usr/lib/i386-linux-gnu/libgfortran.so.3
    linux-gate.so.1 =>  (0xb76ed000)
    libquadmath.so.0 => /usr/lib/i386-linux-gnu/libquadmath.so.0 (0xb7561000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7535000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7516000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb736c000)
    /lib/ld-linux.so.2 (0xb76ee000)

ldd libjulia-release.so 
    linux-gate.so.1 =>  (0xb77bf000)
    libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xb681a000)
    librt.so.1 => /lib/i386-linux-gnu/librt.so.1 (0xb6811000)
    libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb67f5000)
    libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb670c000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb66e0000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb66c2000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb6518000)
    /lib/ld-linux.so.2 (0xb77c0000)

So, both libgfortran and libjulia link to libgcc_s.so.1. This is at least different from the mac.

ViralBShah commented 11 years ago

That doesn't seem to be a problem to me. The gfortran is not provided with XCode and hence it brings its own version of libgcc. libjulia needs libstdc++ which comes from the C toolchain. It seems that the linking and loading is happening correctly.

blakejohnson commented 11 years ago

Alright, I guess it is time to look in a different direction. My test script spends most of its time in OpenBLAS, so perhaps it is just a coincidence that that I never seen it crash outside of OpenBLAS. I'll try some of the perf2 demos that are pure Julia and see if I can get the profiler to crash.

blakejohnson commented 11 years ago

Well, I cannot get the laplace or Go benchmarks to segfault during profiling. So, perhaps OpenBLAS or ccall are important.

ViralBShah commented 11 years ago

@xianyi Any wisdom from you here will be useful. This is a blocker for having a sampling profiler as part of julia base, since it crashes when openblas is executing.

xianyi commented 11 years ago

Hi @blakejohnson , @ViralBShah ,

Could you try the latest develop branch? We fixed some bugs in level-2 BLAS.

Is it multi-threaded OpenBLAS? Could you try the single thread? Or, build OpenBLAS with USE_OPENMP=1

Xianyi

blakejohnson commented 11 years ago

The problem exists with both OPENBLAS_NUM_THREADS=1 and values greater than 1. @ViralBShah what's the easiest way to build agains the OpenBLAS develop branch? We pull in OpenBLAS as a tar.gz file rather than as a submodule. I suppose we just need to point to the appropriate zip file...

ViralBShah commented 11 years ago

I believe if you change the version to develop in Versions.make, it may work. Haven't tried this in a while.

blakejohnson commented 11 years ago

Changing the version to develop does work. Building against the develop branch of OpenBLAS changed the behavior a little bit. Whereas my test script would always cause a segfault, it now occasionally segfaults. When it does segfault, zhemv_ is the last OpenBLAS call on the stack.

blakejohnson commented 11 years ago

So, is there a chance we are looking at some subtle conflict between Apple's libunwind and OpenBLAS? I suggest it only because libunwind seems to work by inspecting certain registers, and OpenBLAS contains hand-rolled assembly code which might modify registers in a way that is unexpected by libunwind.

ViralBShah commented 11 years ago

That is what I suspect - but that is a rather hand wavy explanation.

xianyi commented 11 years ago

However, zhemv_ is written in C. OpenBLAS/interface/zhemv.c

blakejohnson commented 11 years ago

Fair enough. I have also seen in segfault in zaxpy_, zdscal_, and zheevr_.

ViralBShah commented 11 years ago

Could it have to do with openblas possibly having been compiled with -fomit-frame-pointer?

ViralBShah commented 11 years ago

As a test case, could you try profiling code that does not link to BLAS, but calls other C libraries - say fft?

blakejohnson commented 11 years ago

Looks like I can also get a segfault while profiling fft. It happens much less frequently, though. I'm trying to pull a backtrace now.

ViralBShah commented 11 years ago

This is why the julia src is compiled with -fno-omit-frame-pointer, so that libunwind can work correctly with it. Probably, if we want profiling to work with all the libraries, they all will need this option too. It's a bit of work, but shouldn't be too difficult.

What about trying the same with rand?

blakejohnson commented 11 years ago

It took quite a few iterations in gdb, but finally got a segfault and backtrace. Strangely, it didn't bail in FFTW:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xfffffffffffffff0
0x00007fff8def02b2 in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
(gdb) bt
#0  0x00007fff8def02b2 in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
#1  0x00007fff8def01fd in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncoding ()
#2  0x00007fff8def0171 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step ()
#3  0x0000000100099646 in rec_backtrace ()
#4  0x0000000100ff3d18 in sprofile_bt ()
#5  <signal handler called>
#6  0x00007fff8e8791c6 in gettimeofday ()

At least, I don't think gettimeofday is in FFTW.

timholy commented 11 years ago

@ViralBShah, this seems like great detective work. But things like gettimeofday probably can't be compiled with -fno-omit-frame-pointer.

ViralBShah commented 11 years ago

Interestingly, rand does not bail either, and it is compiled with -fomit-frame-pointer, as is fftw. I am still inclined to try build OpenBLAS with this flag or in debug mode. It is likely that we are not getting an accurate stacktrace from libopenblas since it is compiled with optimizations.

Another alternative would be to try using the Apple BLAS with USE_SYSTEM_BLAS=1.

I wonder what other profilers do with stuff like gettimeofday.

ViralBShah commented 11 years ago

The libunwind manpage seems to warn against calling stuff from a signal handler for thread safety. Would that be an issue here, since openblas uses threads? Perhaps we can try by compiling openblas single-threaded.

ViralBShah commented 11 years ago

I have tried all of the above things with OpenBLAS and it still segfaults. I am out of ideas now.

ViralBShah commented 11 years ago

I get a segfault with Apple BLAS too.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xffffffffffffffef
0x00007fff89f1965c in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
(gdb) bt
#0  0x00007fff89f1965c in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
#1  0x00007fff89f1958f in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncoding ()
#2  0x00007fff89f1950e in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step ()
#3  0x0000000100097cd6 in rec_backtrace ()
#4  0x0000000100ef3c10 in sprofile_bt (dummy=14) at profile.c:27
#5  <signal handler called>
#6  0x000000010481d320 in zaxpy_ ()
#7  0x0000000104f0d4f9 in zhetd2_ ()

blakejohnson commented 11 years ago

I've been trying to see what gperftools does for stack unwinding. It looks like they switch between some code that requires frame pointers and libunwind. See line 58 of: https://code.google.com/p/gperftools/source/browse/trunk/src/stacktrace_config.h

So, maybe frame pointers are not required by libunwind?

ViralBShah commented 11 years ago

Compiling openblas with frame pointers did not help - so perhaps that is not why we are seeing the crash. It is potentially just a bug in libunwind.

Has anyone tried this on linux? Do we get a segfault on linux too?

blakejohnson commented 11 years ago

It works just fine on Linux. If we can construct a simplified segfaulting example, then we can submit a bug report to Apple.

On Saturday, June 15, 2013 at 8:29 AM, Viral B. Shah wrote:

Compiling openblas with frame pointers did not help - so perhaps that is not why we are seeing the crash. It is potentially just a bug in libunwind. Has anyone tried this on linux? Do we get a segfault on linux too?

— Reply to this email directly or view it on GitHub (https://github.com/timholy/Profile.jl/issues/14#issuecomment-19495764).

timholy / IProfile.jl

Segfaults #14