juliamatlab / mexjulia

embedding Julia in the MATLAB process.
MIT License
52 stars 14 forks source link

Very frequent segfaults on Linux. #32

Closed twadleigh closed 5 years ago

twadleigh commented 7 years ago

It doesn't find the right libstdc++ when I try jl.eval('1+1'). If I try to force the loading of the right library by setting LD_PRELOAD to the version of libstdc++ against which julia was linked before launching MATLAB, I get:

fatal: error thrown and no exception handler available.
Base.InitError(mod=:Sys, error=ErrorException("could not load symbol "jl_array_cconvert_cstring":
/usr/local/MATLAB/R2016b/bin/glnxa64/MATLAB: undefined symbol: jl_array_cconvert_cstring"))
rec_backtrace at /home/tracy/prj/julia/src/stackwalk.c:84
record_backtrace at /home/tracy/prj/julia/src/task.c:232
jl_throw at /home/tracy/prj/julia/src/task.c:550
jl_errorf at /home/tracy/prj/julia/src/builtins.c:78
jl_dlerror at /home/tracy/prj/julia/src/dlload.c:69 [inlined]
jl_dlsym at /home/tracy/prj/julia/src/dlload.c:241
unknown function (ip: 0x7fba7d750c54)
_getenv at ./env.jl:40
_hasenv at ./env.jl:41 [inlined]
in at ./env.jl:75 [inlined]
haskey at ./dict.jl:7 [inlined]
__init__ at ./sysinfo.jl:60
unknown function (ip: 0x7fba7d832cd8)
jl_call_method_internal at /home/tracy/prj/julia/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/tracy/prj/julia/src/gf.c:1942
jl_apply at /home/tracy/prj/julia/src/julia.h:1392 [inlined]
jl_module_run_initializer at /home/tracy/prj/julia/src/toplevel.c:83
_julia_init at /home/tracy/prj/julia/src/init.c:742
julia_init at /home/tracy/prj/julia/src/task.c:283
jl_init_with_image at /home/tracy/prj/julia/src/jlapi.c:42
mexFunction at /home/tracy/prj/mexjulia/mexjulia.mexa64 (unknown line)
mexRunMexFile at /usr/local/MATLAB/R2016b/bin/glnxa64/libmex.so (unknown line)
unknown function (ip: 0x7fbbb613f1a2)
unknown function (ip: 0x7fbbb6140344)
_ZN8Mfh_file16dispatch_fh_implEMS_FviPP11mxArray_tagiS2_EiS2_iS2_ at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwm_dispatcher.so (unknown line)
_ZN8Mfh_file11dispatch_fhEiPP11mxArray_tagiS2_ at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwm_dispatcher.so (unknown line)
unknown function (ip: 0x7fbbb227a846)
unknown function (ip: 0x7fbbb227aaaa)
unknown function (ip: 0x7fbbb22e0460)
unknown function (ip: 0x7fbbb1c0692f)
unknown function (ip: 0x7fbbb1c08c3b)
unknown function (ip: 0x7fbbb1c0540f)
unknown function (ip: 0x7fbbb1c00854)
unknown function (ip: 0x7fbbb1c00b68)
unknown function (ip: 0x7fbbb1c0520c)
unknown function (ip: 0x7fbbb1c052e1)
unknown function (ip: 0x7fbbb1cfc687)
unknown function (ip: 0x7fbbb1cfeb2e)
unknown function (ip: 0x7fbbb217d10d)
unknown function (ip: 0x7fbbb2144eaa)
unknown function (ip: 0x7fbbb2144fb2)
unknown function (ip: 0x7fbbb21470d8)
unknown function (ip: 0x7fbbb21bfbbd)
unknown function (ip: 0x7fbbb21bff49)
unknown function (ip: 0x7fbbb46e43da)
_Z8mnParserv at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwbridge.so (unknown line)
unknown function (ip: 0x7fbbb5701242)
unknown function (ip: 0x7fbbb57031cd)
_ZN5boost6detail17task_shared_stateINS_3_bi6bind_tIvPFvRKNS_8functionIFvvEEEENS2_5list1INS2_5valueIS6_EEEEEEvE6do_runEv at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwmcr.so (unknown line)
unknown function (ip: 0x7fbbb5702235)
unknown function (ip: 0x7fbbb5ec9b48)
_ZN5boost6detail8function21function_obj_invoker0ISt8functionIFNS_3anyEvEES4_E6invokeERNS1_15function_bufferE at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwiqm.so (unknown line)
_ZN3iqm18PackagedTaskPlugin7executeEP15inWorkSpace_tagRN5boost10shared_ptrIN14cmddistributor17IIPCompletedEventEEE at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwiqm.so (unknown line)
unknown function (ip: 0x7fbbb5e95a09)
unknown function (ip: 0x7fbbb5e8168f)
unknown function (ip: 0x7fbbb5e84047)
unknown function (ip: 0x7fbbc61a1409)
unknown function (ip: 0x7fbbc61a29ae)
_Z25svWS_ProcessPendingEventsiib at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwservices.so (unknown line)
unknown function (ip: 0x7fbbb57018c5)
unknown function (ip: 0x7fbbb5701c41)
unknown function (ip: 0x7fbbb56ef8d5)
unknown function (ip: 0x7fbbc5138709)
unknown function (ip: 0x7fbbc4e720ae)
pcaday commented 7 years ago

I encountered exactly the same problem with RHEL 7 and Matlab 2015a: the wrong version of libstdc++ is used, and if the correct version is forcibly loaded with LD_PRELOAD, Matlab crashes on jl.eval when the symbol jl_array_cconvert_cstring (from libjulia) cannot be found.

From LD_DEBUG information, the relevant symbol is being referenced from sys.so. Attempting to add the directory containing libjulia.so to LD_LIBRARY_PATH didn't help. If I add libjulia to LD_PRELOAD, however, the error no longer occurs -- unfortunately, jl.eval hangs instead.

twadleigh commented 7 years ago

@pcaday thanks for the report. I'm not sure how to proceed without tweaking the Julia build. I think the next thing I might try is to rebuild julia with the compiler that MATLAB wants for its mex files.

pcaday commented 7 years ago

Debugging further, with both libstdc++ and libjulia in LD_PRELOAD, the initialization (jl_init_with_image) succeeds. It appeared from LD_DEBUG output that an error was being thrown from julia somewhere in boot.jl.

After disabling the output redirection in Mex.jl, that error disappeared, but then I got an OutOfMemoryException when the actual Julia command was being run. I'm not sure exactly why that's occuring, but if I disabled jl_mex_outer's interrupt checking (by having jl_mex go straight to jl_mex_inner) jl.eval succeeds.

pcaday commented 7 years ago

Also, problems with libstdc++ version incompatibility seems to be an issue that has come up for Matlab in several contexts (not just with mexjulia). For the problem with sys.so, it might be possible to fix this by adding the directory containing libjulia to its RPATH... I'll check this.

pcaday commented 7 years ago

Apparently including libjulia's directory in the RPATH for sys.so does not correct the problem with loading jl_array_cconvert_cstring... no idea why not (I'm new to this...)

ufechner7 commented 7 years ago

Could rebuilding Julia from source fix this problem?

ufechner7 commented 7 years ago

Ok, I tried, and it does not help. Error message when execution jl.eval('1+1'):

ufechner@TUD277255:~/00Software/mexjulia$ matlab
fatal: error thrown and no exception handler available.
Base.InitError(mod=:Sys, error=ErrorException("could not load symbol "jl_array_cconvert_cstring":
/usr/local/MATLAB/R2016b/bin/glnxa64/MATLAB: undefined symbol: jl_array_cconvert_cstring"))
rec_backtrace at /home/ufechner/julia/src/stackwalk.c:84
record_backtrace at /home/ufechner/julia/src/task.c:232
jl_throw at /home/ufechner/julia/src/task.c:550
jl_errorf at /home/ufechner/julia/src/builtins.c:78
jl_dlerror at /home/ufechner/julia/src/dlload.c:69 [inlined]
jl_dlsym at /home/ufechner/julia/src/dlload.c:241
unknown function (ip: 0x7f8a81941c14)
_getenv at ./env.jl:40
_hasenv at ./env.jl:41 [inlined]
in at ./env.jl:75 [inlined]
haskey at ./dict.jl:7 [inlined]
__init__ at ./sysinfo.jl:60
unknown function (ip: 0x7f8a81a26328)
jl_call_method_internal at /home/ufechner/julia/src/julia_internal.h:189 [inlined]
jl_apply_generic at /home/ufechner/julia/src/gf.c:1945
jl_apply at /home/ufechner/julia/src/julia.h:1392 [inlined]
jl_module_run_initializer at /home/ufechner/julia/src/toplevel.c:83
_julia_init at /home/ufechner/julia/src/init.c:742
julia_init at /home/ufechner/julia/src/task.c:283
jl_init_with_image at /home/ufechner/julia/src/jlapi.c:42
mexFunction at /home/ufechner/00Software/mexjulia/mexjulia.mexa64 (unknown line)
mexRunMexFile at /usr/local/MATLAB/R2016b/bin/glnxa64/libmex.so (unknown line)
unknown function (ip: 0x7f8b454991a2)
unknown function (ip: 0x7f8b4549a344)
_ZN8Mfh_file16dispatch_fh_implEMS_FviPP11mxArray_tagiS2_EiS2_iS2_ at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwm_dispatcher.so (unknown line)
_ZN8Mfh_file11dispatch_fhEiPP11mxArray_tagiS2_ at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwm_dispatcher.so (unknown line)
unknown function (ip: 0x7f8b415d4846)
unknown function (ip: 0x7f8b415d4aaa)
unknown function (ip: 0x7f8b4163a460)
unknown function (ip: 0x7f8b40f6092f)
unknown function (ip: 0x7f8b40f62c3b)
unknown function (ip: 0x7f8b40f5f40f)
unknown function (ip: 0x7f8b40f5a854)
unknown function (ip: 0x7f8b40f5ab68)
unknown function (ip: 0x7f8b40f5f20c)
unknown function (ip: 0x7f8b40f5f2e1)
unknown function (ip: 0x7f8b41056687)
unknown function (ip: 0x7f8b41058b2e)
unknown function (ip: 0x7f8b414d710d)
unknown function (ip: 0x7f8b4149eeaa)
unknown function (ip: 0x7f8b4149efb2)
unknown function (ip: 0x7f8b414a10d8)
unknown function (ip: 0x7f8b41519bbd)
unknown function (ip: 0x7f8b41519f49)
unknown function (ip: 0x7f8b43a3e3da)
_Z8mnParserv at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwbridge.so (unknown line)
unknown function (ip: 0x7f8b44a5b242)
unknown function (ip: 0x7f8b44a5d1cd)
_ZN5boost6detail17task_shared_stateINS_3_bi6bind_tIvPFvRKNS_8functionIFvvEEEENS2_5list1INS2_5valueIS6_EEEEEEvE6do_runEv at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwmcr.so (unknown line)
unknown function (ip: 0x7f8b44a5c235)
unknown function (ip: 0x7f8b45223b48)
_ZN5boost6detail8function21function_obj_invoker0ISt8functionIFNS_3anyEvEES4_E6invokeERNS1_15function_bufferE at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwiqm.so (unknown line)
_ZN3iqm18PackagedTaskPlugin7executeEP15inWorkSpace_tagRN5boost10shared_ptrIN14cmddistributor17IIPCompletedEventEEE at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwiqm.so (unknown line)
unknown function (ip: 0x7f8b451efa09)
unknown function (ip: 0x7f8b451db68f)
unknown function (ip: 0x7f8b451de047)
unknown function (ip: 0x7f8b5547e409)
unknown function (ip: 0x7f8b5547f9ae)
_Z25svWS_ProcessPendingEventsiib at /usr/local/MATLAB/R2016b/bin/glnxa64/libmwservices.so (unknown line)
unknown function (ip: 0x7f8b44a5b8c5)
unknown function (ip: 0x7f8b44a5bc41)
unknown function (ip: 0x7f8b44a498d5)
unknown function (ip: 0x7f8b540946b9)
unknown function (ip: 0x7f8b53dca82c)
ufechner@TUD277255:~/00Software/mexjulia$ 
datnamer commented 7 years ago

There was a potentially helpful reply i your discourse thread. Did it fix the problem?

ufechner7 commented 7 years ago

The comment was: "According to http://stackoverflow.com/questions/5044993/typeinfo-shared-libraries-and-dlopen-without-rtld-global3 matlab does not use RTLD_GLOBAL." I do not see how this comment is related to my problem. I am not using RTLD_GLOBAL.

twadleigh commented 7 years ago

@ufechner7 I also don't really understand the text of the comment itself, either, but the link looks like it might have some clues about things to try. I'm hoping I'll have some time in the next couple of weeks to take another stab at this issue. If you or others try to make headway, I'll do what I can to support your efforts.

ihnorton commented 7 years ago

Try calling dlopen("[path/to/libjulia]", RTLD_GLOBAL) from the mex code, before doing anything else with libjulia.

(I tried to debug/test myself, but the main GCC is too old on the cluster where I have access to linux/matlab -- and the newer ones advertised as loadable by the sysadmin are missing dependencies)

twadleigh commented 7 years ago

Thanks @ihnorton. This seems to have gotten me over the hump. I had to or in RTLD_LAZY as well to prevent crashing. I don't yet have it completely working, but at least what I'm seeing now is an unhandled Julia exception, which means the runtime appears to be loading.

twadleigh commented 7 years ago

I'm making headway, but now I'm seeing that:

macro mx_test_is(fun)
    :( ccall($(fun)::Ptr{Void}, Bool, (Ptr{Void},), mx.ptr) )
end

is_double(mx::MxArray) = @mx_test_is(_mx_is_double)
...

is not making well-defined functions. I'm seeing:

UndefVarError: mx not defined
 in is_numeric at /home/tracy/prj/matlab/mexjulia/jl/mxarray.jl:190 [inlined]

It seems the mx in the macro isn't being captured in the function definition. If I write out the function definition, avoiding the macro, I get past this one.

I'm not seeing this on windows. It might also be because I'm currently testing against master.

ihnorton commented 7 years ago

There was a recent change impacting macro hygiene. I think you might need to escape 'mx'.

On Fri, Dec 23, 2016 at 11:23 PM Tracy Wadleigh notifications@github.com wrote:

I'm making headway, but now I'm seeing that:

macro mx_test_is(fun)

:( ccall($(fun)::Ptr{Void}, Bool, (Ptr{Void},), mx.ptr) )

end

is_double(mx::MxArray) = @mx_test_is(_mx_is_double)

...

is not making well-defined functions. I'm seeing:

UndefVarError: mx not defined

in is_numeric at /home/tracy/prj/matlab/mexjulia/jl/mxarray.jl:190 [inlined]

It seems the mx in the macro isn't being captured in the function definition. If I write out the function definition, avoiding the macro, I get past this one.

I'm not seeing this on windows. It might also be because I'm currently testing against master.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/twadleigh/mexjulia/issues/32#issuecomment-269068404, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUAGgUDwjjClTeidAYFto0wBis0tPTAks5rLJ49gaJpZM4Kqars .

ihnorton commented 7 years ago

Ref: https://discourse.julialang.org/t/def-macro-generator-broken-on-master/1096/2

twadleigh commented 7 years ago

I just verified that this doesn't happen with release-0.5. Thanks for the heads-up.

ufechner7 commented 7 years ago

So this means, everything is working now on Ubuntu?

ufechner7 commented 7 years ago

In which file (at which location) would you insert the line: dlopen("[path/to/libjulia]", RTLD_GLOBAL) ?

twadleigh commented 7 years ago

So this means, everything is working now on Ubuntu?

No, but it does mean that I've at least gotten past the shared library fiasco. I've had to disable output redirection and interrupt checking, and it still seems to want to crash all the time, but I have at least gotten a simple evaluation to work once (with release-0.5). I'm in the middle of patching the macro stuff so that it can work with master. After that, I'll try and see if I can turn output redirection and interrupt checking back on.

In which file (at which location) would you insert the line: dlopen("[path/to/libjulia]", RTLD_GLOBAL) ?

It goes in mexjulia.cpp right before runtime initialization.

twadleigh commented 7 years ago

I pushed a commit to master that fixes the issue with loading libjulia on Linux as well as accommodating the recent fixes with macro hygiene. I re-enabled interrupt checking and output redirection and verified both were basically working.

However, it is still unusably crashy. Frequent segfaults. I suspect either MATLAB and Julia not playing nicely w.r.t. signal handling or a continuation of JuliaLang/julia#19401.

~Meanwhile, testing the latest master on windows is giving errors, which I am currently working to mitigate.~ (Edit: down to user error.)

twadleigh commented 7 years ago

Master works as I would expect on Linux, except, of course, for the constant segfaults. I'll rename this issue accordingly.

Unfortunately, I expect a lot of work will be required to ultimately resolve this issue, with the bulk of it needing to be done in Julia itself by someone with skills that I don't have.

Near term, the task is to investigate and describe the issue as well as possible for reporting upstream.

ufechner7 commented 7 years ago

Test on Ubuntu 16.04, Matlab 2016b: jl.eval('1+1') sometimes works. It also works repeatedly, as long as I don't use the up key. But Matlab crashes as soon as I click on the editor window of Matlab.

ufechner7 commented 7 years ago

Some links, that could help: https://nl.mathworks.com/help/matlab/matlab_external/debugging-on-linux-platforms.html I tried this, and got the message:

For online documentation, see http://www.mathworks.com/support
For product information, visit www.mathworks.com.

[New Thread 0x7fffbf4a7700 (LWP 5460)]
>> dbmex on
>> jl.eval('1+1')

MEX FILE: /home/ufechner/mexjulia/mexjulia.mexa64 entry point located at address 0xbeaa5094
Add breakpoints at the debugger prompt and issue a "continue" to resume 
execution of MATLAB.

Thread 1 "MATLAB" received signal SIGUSR1, User defined signal 1.
pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) 

When I googled this error message, I found the following link: https://github.com/etcimon/libasync/issues/7

With the remark: " I think I found the issue. The threads were being forcefully closed during the wait condition: you're missing a call to destroyAsyncThreads, add in sass.d:

static ~this() { import libasync.threads : destroyAsyncThreads; destroyAsyncThreads(); }

This fixed the segfault for me on Ubuntu =) " Perhaps this is the issue here, too?

twadleigh commented 7 years ago

Hmm, that looks like a good lead, @ufechner7. I'll investigate further. Thanks!

ufechner7 commented 7 years ago

Any progress on this issue? Does it make more sense to test against Julia 0.5 or against the unstable 0.6 version? Or against 0.5 git (coming 0.5.1)?

ufechner7 commented 5 years ago

Why did you close this issue? Is it fixed?

twadleigh commented 5 years ago

No, it's not fixed. I closed it because, being its author, it shows up on my list of open issues on github. I don't (can't) work on this project anymore, and wanted to reduce the noise on that list.

If you would like, feel free to reopen, though.