stephengold / Libbulletjme

A JNI interface to Bullet Physics and V-HACD
https://stephengold.github.io/Libbulletjme/lbj-en/English/overview.html
Other
86 stars 10 forks source link

SIGSEGV crash in `btRaycastVehicle::updateFriction()` #21

Closed Spliterash closed 1 year ago

Spliterash commented 1 year ago

Hi, When I using the library, the following crashes periodically for no reason:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f08cad49b2a, pid=237, tid=428
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.5+8 (17.0.5+8) (build 17.0.5+8)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.5+8 (17.0.5+8, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  [libbulletjme_v17.2.0_linux64_release_sp.so+0x2d8b2a]  btRaycastVehicle::updateFriction(float)+0xc2a
#
# Core dump will be written. Default location: /data/core
#
# An error report file with more information is saved as:
# /data/hs_err_pid237.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

Most likely the problem is in my code, maybe I don’t close something, but how can I understand where? It just writes it and nothing else. Please help me to guess what could be causing the problem.

hs_err_pid237.log

stephengold commented 1 year ago

Thank you for reporting this issue, @Spliterash! Crashes in native libraries may be difficult to debug, but they always happen for a reason. I'll gladly analyze the crash log you provided. Perhaps I can deduce what happened.

Meanwhile, since you are using a Release native library, please attempt to reproduce the crash with the corresponding Debug native library from GitHub (probably "Linux64DebugSp_libbulletjme.so"). If you load the library using NativeLibraryLoader.loadLibbulletjme(), you'll also need to change the 3rd argument from "Release" to "Debug". The Debug natives are slow, but they include many extra checks that can catch trouble before it crashes the JVM.

I'll get back to you shortly with a crash-log analysis, and we'll take it from there.

stephengold commented 1 year ago

Thanks for providing a link to the JVM crash log.

The crash occurred in Bullet, while advancing the simulation: btRaycastVehicle::updateFriction(float). That narrows it down to about 160 lines of code: https://github.com/stephengold/Libbulletjme/blob/38936628048e8dbc91dc443e1c250a500dec27bf/src/main/native/bullet3/BulletDynamics/Vehicle/btRaycastVehicle.cpp#L475-L643 Since Release native was used, the log doesn't report the line number or the native call stack. A log generated with Debug natives would surely provide more clues.

This doesn't resemble any issues I've seen recently, so I doubt upgrading to v17.4.0 will help. It shouldn't hurt though.

The current thread is "Server Physics Thread" and the Java call stack includes CompletableFuture$AsyncRun.run() which makes me wonder about thread safety. Is there any possibility PhysicsSpace is being accessed from more than one thread?

Spliterash commented 1 year ago

CompletableFuture$AsyncRun.run() is called via CompletableFuture.runAsync({/*Task*/},physicsThread). This crash is rare and I don't know what is causing it. I installed the debug version, and if this happens again, I will definitely show what it gives. Thanks again for your help

stephengold commented 1 year ago

Hey, @spliterash! One more clue I forgot to mention:

btRaycastVehicle is the native implementation of vehicle dynamics, so the more PhysicsVehicle objects you have in your PhysicsSpace, the more times this code will execute per time step.

Spliterash commented 1 year ago

Im get another error, when spawn and despawn many cars java: /home/travis/build/stephengold/Libbulletjme/src/main/native/glue/jmePhysicsSpace.cpp:235: static void jmePhysicsSpace::contactStartedCallback(btPersistentManifold* const&): Assertion `pm->getObjectType() == BT_PERSISTENT_MANIFOLD_TYPE' failed.

Its all what i get, no hs_err_pid files, except 5gb dump

Spliterash commented 1 year ago

I have revised my code and I have places where there is a call without checking thread. Thanks for the note about threads.

stephengold commented 1 year ago

I'm glad you got the trouble straightened out.

Spliterash commented 1 year ago

hs_err_pid7836.log Haven't made my fix yet, but caught it in the debug version

MelonHell commented 1 year ago

is it possible to somehow make the jvm not crash when physics crashes?

stephengold commented 1 year ago

Given that the physics is implemented in native code, preventing JVM crashes is a difficult problem. One solution might be to use a physics engine that's entirely written in a JVM-based language---such as JBullet, for instance.

Concurrent access to a physics space (by different threads) could be prevented in various ways, for instance adding locks or runtime checks.

I'll take a look at the latest crash log.

stephengold commented 1 year ago

The original log (hs_err_pid237.log) was from a Linux host, and the new one (hs_err_pid7836.log) is from a Windows host. Why is that?

Since the address being dereferenced is 0x20, I suspect a NULL pointer is being dereferenced.

From the new log, I can't easily determine in which function the crash occurred. (The new crash may have a completely different root cause.) I'll use a disassembler to learn more.

To obtain a human-readable stack trace from a Windows JVM, you'd need to download the PDB file for the native library (Windows64DebugSp_bulletjme.pdb) to the same folder where the native DLL is located. If you plan further testing of Debug natives on Windows, please download the PDB.

Or perhaps fixing the threading issues you found will resolve this issue...

MelonHell commented 1 year ago

The original log (hs_err_pid237.log) was from a Linux host, and the new one (hs_err_pid7836.log) is from a Windows host. Why is that?

we tested it on another host

From the new log, I can't easily determine in which function the crash occurred. (The new crash may have a completely different root cause.) I'll use a disassembler to learn more

the error was caused in the same way, spawning and deleting a lot of cars

To obtain a human-readable stack trace from a Windows JVM, you'd need to download the PDB file for the native library (Windows64DebugSp_bulletjme.pdb) to the same folder where the native DLL is located. If you plan further testing of Debug natives on Windows, please download the PDB.

hs_err_pid18292.log

stephengold commented 1 year ago

"hs_err_pid18292.log" describes a crash unrelated to the previous two. The latest crash occurred in the JVM itself (not Libbulletjme) during an invocation of PhysicsSpace.addRigidBody().

Spliterash commented 1 year ago

Got it again. The only things left out of thread physics are control commands. Do i need call accelerate, brake and other controll method in physic thread ? hs_err_pid238.log

stephengold commented 1 year ago

Do i need call accelerate, brake and other control method in physic thread?

I believe you do.

hs_err_pid238.log

At last, a stack trace showing details of a crash in Bullet! Still no line numbers ... I'm unsure why.

The crash occurred here: https://github.com/stephengold/Libbulletjme/blob/4d19c315fe9db52b00b9e940ea442318ccb59f68/src/main/native/bullet3/BulletDynamics/Vehicle/btRaycastVehicle.cpp#L625-L626

Making notes to myself now:

The "-" operator at the end of line 625 represents vector subtraction, which is implemented here: https://github.com/stephengold/Libbulletjme/blob/38936628048e8dbc91dc443e1c250a500dec27bf/src/main/native/bullet3/LinearMath/btVector3.h#L797-L800

The movss instruction that signaled SIGSEGV was dereferencing %rax, which should contain the virtual address of the v1 argument, but actually contains 0x38 (not a valid address). That address was passed from updateFriction(), and it was supposed to refer to wheelInfo.m_raycastInfo.m_contactPointWS.

wheelInfo.m_raycastInfo.m_contactPointWS was recently dereferenced in line 614, so I suspect the address got corrupted sometime between line 614 and line 625.

I reviewed https://github.com/stephengold/Libbulletjme/commit/ff15ca4faeed7f7efea0bf1e095db57ab8141694 but didn't see any way it could corrupt m_contactPointWS.