opensim-org / opensim-core

SimTK OpenSim C++ libraries and command-line applications, and Java/Python wrapping.
https://opensim.stanford.edu
Apache License 2.0
801 stars 324 forks source link

Potential MATLAB bug w/TableReporter class #2928

Open y-lai opened 3 years ago

y-lai commented 3 years ago

Hi dev team,

I'm working with the MATLAB bindings for OpenSim 4.1 on a Linux system (Ubuntu 18.04). In the installation process, I was able to build, make and, make install all the source files from the 4.1 release. However, during ctests, I was unable to pass a few MATLAB specific ctests. The list which continues to fail are:

108:Matlab_wiringInputsAndOutputsWithTableReporter
109:Matlab_RunHopper_answers
110:Matlab_RunHopperWithDevice_answers
118:Matlab_testWalkerScripts

The matlab crash dumps from these failed tests are attached with prefixes of ctest_Matlab_xxx ctest_Matlab_RunHopper_answers_MATLABcrashdump3399-1.txt ctest_Matlab_RunHopperWithDevice_answers_MATLABcrashdump3392-1.txt ctest_Matlab_wiringInputsAndOutputsWithTableReporter_MATLABcrashdump3391-1.txt The full log for the whole ctest is below: LastTest.log

I've tried to isolate whether the issue is because of my software/hardware, thus tested in multiple hardware and kernels. All of the errors seem to be related to an issue with Intel Math Kernel Library (MKL) functions, specifically in the LAPack library with the getrf function. So I was able to test with different MKL versions from different Ubuntu kernels. The matlab (R2018b) logs are all from this device/software: Linux ylai-ubuntu 4.15.0-91-generic MKL Version:

Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications, CNR branch AVX2
Linear Algebra PACKage Version 3.7.0

I was able to get a colleague to try the ctests as well and produced the same ctest errors with the same 4 Matlab ctests. Their device used Matlab version (R2018b) on kernel: Linux laptop 4.15.0-91-generic MKL version:

Intel(R) Math Kernel Library Version 2018.0.3 Product Build 20180406 for Intel(R) 64 architecture applications, CNR branch AVX2
Linear Algebra PACKage Version 3.7.0

Further tests were conducted again with a different Matlab version (R2020b) and a different linux kernel and the same ctests failed. Linux ylai-ubuntu 5.4.0-59-generic MKL version: Intel(R) Math Kernel Library Version 2019.0.3 Product Build 20190125 for Intel(R) 64 architecture applications, CNR branch AVX2

Looking into the test code .m files themselves, I notice one common theme with the 4 scripts, which is the inclusion of the TableReporter class instance. Other Matlab ctests have been successful, and these all do not have the TableReporter class instance in them. While running these simulations, the matlab crashes occur when pressing 'Any Key' to start the simulation in the simbody visualizer. The error files from the MATLAB crash report are below RunHopper_answers_matlab_crash_dump.9822-1.txt RunHopperWithDevice_answers_matlab_crash_dump.10448-1.txt WiringInputsAndOutputsWTableReporter_matlab_crash_dump.11054-1.txt WiringInputsAndOutputsWTableReporter_matlab_crash_dump.11054-2.txt

My LDD for the opensim libraries are below:

y-lai@ylai-ubuntu:~/opensim-core/lib$ ldd libosimJavaJNI.so 
    linux-vdso.so.1 (0x00007ffe37d65000)
    libosimTools.so => /home/y-lai/opensim-core/lib/libosimTools.so (0x00007f7a60318000)
    libosimExampleComponents.so => /home/y-lai/opensim-core/lib/libosimExampleComponents.so (0x00007f7a600b2000)
    libosimAnalyses.so => /home/y-lai/opensim-core/lib/libosimAnalyses.so (0x00007f7a5fd1f000)
    libosimActuators.so => /home/y-lai/opensim-core/lib/libosimActuators.so (0x00007f7a5f9b3000)
    libosimSimulation.so => /home/y-lai/opensim-core/lib/libosimSimulation.so (0x00007f7a5f0f8000)
    libosimCommon.so => /home/y-lai/opensim-core/lib/libosimCommon.so (0x00007f7a5eb7f000)
    libosimLepton.so => /home/y-lai/opensim-core/lib/libosimLepton.so (0x00007f7a5e946000)
    libSimTKsimbody.so.3.7 => /home/y-lai/opensim-core/lib/libSimTKsimbody.so.3.7 (0x00007f7a5e28c000)
    libSimTKmath.so.3.7 => /home/y-lai/opensim-core/lib/libSimTKmath.so.3.7 (0x00007f7a5dc42000)
    libSimTKcommon.so.3.7 => /home/y-lai/opensim-core/lib/libSimTKcommon.so.3.7 (0x00007f7a5d6ee000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f7a5d4cf000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f7a5d146000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f7a5cda8000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f7a5cb90000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f7a5c79f000)
    liblapack.so.3 => /usr/lib/x86_64-linux-gnu/liblapack.so.3 (0x00007f7a5bf01000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f7a5bcf9000)
    libBTKCommon.so.0.4dev => /home/y-lai/opensim-core/lib/libBTKCommon.so.0.4dev (0x00007f7a5ba9f000)
    libBTKBasicFilters.so.0.4dev => /home/y-lai/opensim-core/lib/libBTKBasicFilters.so.0.4dev (0x00007f7a5b831000)
    libBTKIO.so.0.4dev => /home/y-lai/opensim-core/lib/libBTKIO.so.0.4dev (0x00007f7a5b474000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f7a5b270000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f7a61544000)
    libblas.so.3 => /usr/lib/x86_64-linux-gnu/libblas.so.3 (0x00007f7a5b003000)
    libgfortran.so.4 => /usr/lib/x86_64-linux-gnu/libgfortran.so.4 (0x00007f7a5ac24000)
    libquadmath.so.0 => /usr/lib/x86_64-linux-gnu/libquadmath.so.0 (0x00007f7a5a9e4000)

It would be great to see if this is reproducible on any of the dev's devices since I've exhausted a lot of my options trying to troubleshoot this error. The errors with LAPack function getrf seem to also crop up when attempting to initialise any custom models I'm using.

P.S. Not sure if this might be related to another issue with the MATLAB bindings for SimTK::MultibodySystem where parent class functions aren't recognised/available which I've posted in the forum here.

Cheers, y-lai

adamkewley commented 3 years ago

The bug here are these lines in your crash dump:

[ 14] 0x00007fae5f06d7ea        /usr/local/MATLAB/R2019b/bin/glnxa64/mkl.so+05416938 mkl_lapack_dgetrf_int+00007658
[ 15] 0x00007fae5ef9993e        /usr/local/MATLAB/R2019b/bin/glnxa64/mkl.so+04548926 mkl_lapack_dgetrf+00000718
[ 16] 0x00007fae5eec4e43        /usr/local/MATLAB/R2019b/bin/glnxa64/mkl.so+03677763 DGETRF+00000227
[ 17] 0x00007fae6d428917 /home/y-lai/opensim-core/lib/libSimTKsimbody.so.3.7+03270935 _ZN5SimTK13lapackInverseILi6EdLi6ELi1EEENS_3MatIXT_EXT_ET0_XT1_EXT2_EE7TInvertERKS3_+00000199
[ 18] 0x00007fae6d42a2a5 /home/y-lai/opensim-core/lib/libSimTKsimbody.so.3.7+03277477

libSimTKsimbody.so shouldn't be calling into MKL at all - there's no code dependency between the two. Instead, it should be calling DGETRF in libblas or liblapack. The segfault is probably because MKL's DGETRF has a slightly different ABI from the DGETRF that libSimTKsimbody.so was compiled against. e.g. Simbody loads up the registers/stack in an incorrect order before jmping to the wrong function implementation. The implementation then mis-uses the arguments and :boom:.

I can confirm that this bug does seem to happen on Linux, because I encountered it when trying to get Linux + OpenSim + MATLAB to work for a PhD student. I wasn't sure whether it was because i mis-compiled the software (e.g. wrong compiler, stdlib) and I was lucky, because we could port the research code from MATLAB to python.

Assuming dynamic loading is done properly (probably, automatically installed by the compiler for libSimTKsimbody), this shouldn't happen because symbols are usually associated to a particular libary at link time (e.g, when linking libSimTKsimbody.so, the linker should resolve DGETRF against libblas.so/liblapack.so and place an entry in the resulting ELF). The fact this isn't working might be a sign of a build error somewhere (e.g. the symbol is left unresolved and the binary is indirectly reliant on the dynamic linker linking something else to resolve it).

adamkewley commented 3 years ago

I don't really have a huge amount of time to plunge into this right now (I already spent a day or two on this the first time around), but it looks like MATLAB is a bit annoying about how it loads libraries in general.

For example, it can't even load OpenSim's libraries in the first place (e.g. before I reproduce your error) because MATLAB already pre-loads a different version of the C++ standard library from the one (on my system) that I built with:

CTEST_OUTPUT_ON_FAILURE=1 ctest -R Matlab
# ....
Failed to load one or more dynamic libraries for OpenSim.
java.lang.UnsatisfiedLinkError: /home/adam/Desktop/opensim-core/opensim-core-build/libosimJavaJNI.so: /usr/local/MATLAB/R2018b/bin/glnxa64/../../sys/os/glnxa64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /home/adam/Desktop/opensim-core/opensim-core-build/libosimJavaJNI.so)
# ...
Failed to load one or more dynamic libraries for OpenSim.
java.lang.UnsatisfiedLinkError: /home/adam/Desktop/opensim-core/opensim-core-build/libosimJavaJNI.so: /usr/local/MATLAB/R2018b/bin/glnxa64/../../sys/os/glnxa64/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /home/adam/Desktop/opensim-core/opensim-core-build/libosimJavaJNI.so)
See https://simtk-confluence.stanford.edu/display/OpenSim40/Scripting+with+Matlab
Java exception occurred: 
java.lang.UnsatisfiedLinkError: org.opensim.modeling.opensimSimulationJNI.new_Model__SWIG_1(Ljava/lang/String;)J
    at org.opensim.modeling.opensimSimulationJNI.new_Model__SWIG_1(Native Method)
    at org.opensim.modeling.Model.<init>(Model.java:783)

(the other, non-Matlab, tests all pass)

Iirc, last time i went down this rabbit-hole I ended up having to monkey-patch MATLAB's C++ library to a newer version (or recompile OpenSim with a compatability flag set - again, iirc this causes other issues such as not being able to compile parts of it). I tried 2018b, 2019, and 2020 editions with no luck.

Again, though, this should probably be investigated, but it might require something extensive (e.g. statically compiling libraries that Simbody/OpenSim share with MATLAB) or it might be something simple (e.g. an environment variable that affects how MATLAB loads plugins via JNI).

y-lai commented 3 years ago

Hi adam, Thanks for the quick response and confirming the issue with the Linux, OpenSim, Matlab combo. I understand that further investigation into this will take time (which you guys might not have) and your response has really helped me with the available options I have in the near future. Will attempt different avenues for opensim to overcome this. [e.g.non matlab and/or other custom implementation rather than an opensim plugin] Cheers for your time!

mjhmilla commented 2 years ago

Dear @y-lai and @adamkewley ,

Thank you very much for the detailed notes. I'm seeing the same problem on Ubuntu 20.04 (which a manually installed Java 1.7.0_08) and Matlab 2021b.

alexswerner commented 1 month ago

I'm seeing the same problem here (calls into MKL which should have been lapack/blas calls). After some analysis valgrind did not produce anything of use in our code, so I think that this another incarnation of this problem. In this case the calls are to idamax and are coming from ipopt:

#0  0x00007ffb6a205e88 in mkl_blas_cnr_def_xidamax_nonan () from /usr/local/MATLAB/R2024b/bin/glnxa64/mkl.so
#1  0x00007ffb6a17ccbd in mkl_blas_cnr_def_xidamax () from /usr/local/MATLAB/R2024b/bin/glnxa64/mkl.so
#2  0x00007ffb690a005e in mkl_blas_idamax () from /usr/local/MATLAB/R2024b/bin/glnxa64/mkl.so
#3  0x00007ffb68e77c66 in mkl_blas.idamax () from /usr/local/MATLAB/R2024b/bin/glnxa64/mkl.so
#4  0x00007ffdd5ddea1c in Ipopt::IpBlasIamax(int, double const*, int) () from /home/etahvili/opensim/opensim-core/sdk/lib/libipopt.so.3

This call then causes a SIGSEGV. However, I think there is a work around using

export LD_PRELOAD="/lib/x86_64-linux-gnu/liblapack.so.3:/lib/x86_64-linux-gnu/libblas.so.3"

Those are the libraries the purpose-built ipopt in opensim-core links to. At least with this setup, the example runs.