keenon / nimblephysics

Nimble: Physics Engine for Biomechanics and Deep Learning
http://www.nimblephysics.org
Other
403 stars 44 forks source link

Python operations on shape nodes crash #184

Open peabody124 opened 1 year ago

peabody124 commented 1 year ago

I've tracked down a regression in the python bindings that causes a segafult when nimblephysics when from 0.8.77 to 0.8.78

This can be easily reproduced with:

import nimblephysics as nimble

skeleton = nimble.RajagopalHumanBodyModel().skeleton

nodes = skeleton.getBodyNodes()
shapeNode = nodes[0].getShapeNode(0)
shapeNode.getName()

it also occurs with getShape()

This can be reproduced in the devcontainer, too.

I'm pretty sure this change https://github.com/keenon/nimblephysics/pull/105/files#diff-a5184a1951f89b1f86b8b45fd5157f4bdb0be20ea8c62e8fefa74cc2c28371dc or the similar one in ShapeFrame is related, but I'm not certain how

keenon commented 1 year ago

My rough guess here is that in this example, the actual C++ structure for skeleton goes out of scope right after you stop referencing nodes, and then gets freed. Then shapeNode, which since the refactor you pointed to is no longer a shared_ptr, but just a regular plain-vanilla pointer, is a pointer to freed memory (if it were a shared_ptr, it would have prevented the skeleton from getting freed). Then calling shapeNode.getName() is likely to crash (but probably won't always crash on every computer and OS, depending on how their underlying memory management works).

If my hunch about root cause is right, then this may be a tricky bug to solve, because it would probably require bringing back shared_ptr in the definition of the ShapeNode in our pybind11 bindings (though there's hope that perhaps we could use pybind11's reference_internal, maybe?). The trouble with that is that (because of pybind11 limitations that I don't remember the exact details of) doing that breaks our ability to generate *.pyi stubs, which breaks autocomplete in PyCharm.

peabody124 commented 1 year ago

@keenon I don't think it's as simple as that. I found this in code where it keeps the skeleton around much longer. I simply can never access getName or getShape without a segfault.

Thanks for the suggestion, I'm going to look into trying reference_internal.

keenon commented 1 year ago

Good luck! I'd highly recommend reading the docs on return value policies in pybind11 (https://pybind11.readthedocs.io/en/stable/advanced/functions.html#return-value-policies), and then looking at the corresponding Python bindings (for example, the binding for Skeleton here: https://github.com/keenon/nimblephysics/blob/master/python/_nimblephysics/dynamics/Skeleton.cpp)

peabody124 commented 1 year ago

Thanks, looks like a good reference.

I'm finding quite an odd pattern of crashes using the shapeNodes. I also found even calling getBodyNodes() and letting the code end would crash (perhaps that has been changed on a more recent block of code, but I'm working near where the regression occurrred). Hopefully this can also help me figure out how to avoid needing to manually delete the fitMarkers to avoid segfaults from that, too.

import nimblephysics as nimble

skeleton = nimble.RajagopalHumanBodyModel().skeleton

nodes = skeleton.getBodyNodes()
# just calling this and letting the program terminate causes a crash if
# make getBodyNodes use reference_internal

shapeNode = nodes[0].getShapeNode(1)

print(type(shapeNode))
print(dir(shapeNode))
print(shapeNode.getShapeNodeProperties())  # on ShapeNode, no crash
print(shapeNode.getRelativeRotation())     # on ShapeNode, no crash

print(shapeNode.getShape())             # method on ShapeFrame, crashes
print(shapeNode.isShapeNode())           # method on ShapeFrame, no crash

#print(shapeNode.getRelativeTransform()) # method on Frame, causes a crash
print(shapeNode.getLinearVelocity())     # method on Frame, no crash
print(shapeNode.getTransform())          # method on Frame, no crash

#print(shapeNode.getParentFrame())       # method on Entity, crash
#print(shapeNode.getName())              # method on Entity, crash
print(shapeNode.isFrame())               # method on Entity, no crash
#shapeNode.setName("Hello")              # method on Entity, crash
peabody124 commented 1 year ago

@keenon so I spent a while trying to see if anything about the return type around the methods would prevent the crash and had no success.

I have two possible solutions that fix the crash. The first is reverting the move away from shared_ptr: https://github.com/keenon/nimblephysics/compare/master...peabody124:nimblephysics:revert_shared_ptr

This seems the most correct to me. Dart still has all of these classes defined as shared_ptr. I can't really follow the logic of why this behavior was changed, so perhaps I am missing some context. On this branch, all of the calls above go through without crashing.

Another solution that is less elegant but more compact is here https://github.com/keenon/nimblephysics/compare/master...peabody124:nimblephysics:get_shape_crash_fix which basically exposes a new method that gets the shape directly from the body node and doesn't go through python to trigger bad garbage collection.

Thoughts? Happy to submit a PR.

p.s., any thoughts about adding python unit tests?

keenon commented 1 year ago

Here's a fix that works on my local machine, without rollback: https://github.com/keenon/nimblephysics/pull/186

It turned out there are two issues. One is an incorrect handling of ownership of getBodyNodes() in the Python bindings, but another was an issue with legacy multiple-inheritance from DART. (Personally, all the multiple-inheritance in the code base is something I would never have voluntarily done). The issue here is that if we create the Pybind11 bindings with the multiple-inheritance explicit in them, then Pybind11 will generate class names that include C++ syntax. That C++ syntax then breaks IDEs (like PyCharm) that are attempting to parse the method signatures on Nimble's native library, and so then you lose autocomplete for users of the Python library (which is really unacceptable). But if you specify a simplified (and not entirely accurate) inheritance tree to Pybind11 where each class only inherits from one parent, when you attempt to call the parent's methods without an explicit re-binding in Pybind11's definition of the child class, you crash on a segfault. ShapeNode is one such example. I don't want to revert the refactor you pointed to, because that's what fixed autocomplete for users of the Python library, and I don't want to refactor the entire C++ codebase to get rid of multiple inheritance (because that would touch hundreds of files and hundreds of tests and I would never really trust I hadn't introduced a bunch of bugs in the refactor), and so the "solution" (sad as it is) is to add explicit bindings where they're missing.

peabody124 commented 1 year ago

Thanks. That fixes the main issues I was concerned about. There are still two crashes in my test cases using 0.9.10 (that weren't there on the revert branch). Both of these commented lines below do it:

import time
import nimblephysics as nimble

skeleton = nimble.RajagopalHumanBodyModel().skeleton

nodes = skeleton.getBodyNodes()

print(nodes[0].getTransform())         # method on BodyNode, no crash
print(nodes[0].getName())              # method on BodyNode, no crash

shapeNode = nodes[0].getShapeNode(1)

print(type(shapeNode))
print(dir(shapeNode))
print(shapeNode.getShapeNodeProperties())  # on ShapeNode, no crash
print(shapeNode.getRelativeRotation())     # on ShapeNode, no crash

print(shapeNode.getShape())             # method on ShapeFrame, no crash
print(shapeNode.isShapeNode())           # method on ShapeFrame, no crash

#print(shapeNode.getRelativeTransform())  # method on Frame, causes a crash
print(shapeNode.getLinearVelocity())     # method on Frame, no crash
print(shapeNode.getTransform())          # method on Frame, no crash

#print(shapeNode.getParentFrame())       # method on Entity, crash
print(shapeNode.getName())              # method on Entity, no crash
print(shapeNode.isFrame())               # method on Entity, no crash
peabody124 commented 1 year ago

Also while bringing these things up I've also found this code segfaults without the del fitMarkers

        fitMarkers = results[0].updatedMarkerMap
        marker_offsets_map = {}
        for k in fitMarkers:
            v = fitMarkers[k]
            marker_offsets_map[k] = (v[0].getName(), v[1])
        # del fitMarkers
peabody124 commented 1 year ago

@keenon I found another related bug:

skeleton = nimble.RajagopalHumanBodyModel().skeleton

for b in skeleton.getBodyNodes():
    n = b.getNumShapeNodes()
    for i in range(n):
        s = b.getShapeNode(i)
        name = s.getName()
        print(i, name)
        shape = s.getShape()
        print("shape", shape)
        scale = shape.getScale()
        print("scale", scale)

Also crashes after a few nodes. It seems to crash after getting the name before printing the shape.

I'm wondering from extrapolating from your prior fix if a getShape needs to be defined here? https://github.com/keenon/nimblephysics/pull/186/files#diff-a5184a1951f89b1f86b8b45fd5157f4bdb0be20ea8c62e8fefa74cc2c28371dcR110