keenon / AddBiomechanics

A tool to automatically process and share biomechanics data
https://addbiomechanics.org/
Other
30 stars 5 forks source link

Moving copy_snapshots() call in a separate process to protect against segfaults #268

Closed keenon closed 4 months ago

keenon commented 4 months ago

It turns out our data harvester logs are full of attempting to copy the same subject over and over again, and then crashing with a segfault. Here's a representative snippet:

^[[1;33mWarning [CompositeResourceRetriever.cpp:96]^[[0m [CompositeResourceRetriever::retrieve] All ResourceRetrievers registered for this schema failed to retrieve the URI 'file:///tmp/tmpu1bq7e4i/Geometry/Rib1R.vtp.ply' (tried 1).
^[[1;33mWarning [MeshShape.cpp:493]^[[0m [MeshShape::loadMesh] Failed loading mesh 'file:///tmp/tmpu1bq7e4i/Geometry/Rib1R.vtp.ply' with ASSIMP error 'Unable to open file "file:///tmp/tmpu1bq7e4i/Geometry/Rib1R.vtp.ply".'.
^[[1;33mWarning [LocalResource.cpp:48]^[[0m [LocalResource::constructor] Failed opening file '/tmp/tmpu1bq7e4i/Geometry/Sternum.vtp.ply' for reading: No such file or directory
^[[1;33mWarning [CompositeResourceRetriever.cpp:96]^[[0m [CompositeResourceRetriever::retrieve] All ResourceRetrievers registered for this schema failed to retrieve the URI 'file:///tmp/tmpu1bq7e4i/Geometry/Sternum.vtp.ply' (tried 1).
^[[1;33mWarning [MeshShape.cpp:493]^[[0m [MeshShape::loadMesh] Failed loading mesh 'file:///tmp/tmpu1bq7e4i/Geometry/Sternum.vtp.ply' with ASSIMP error 'Unable to open file "file:///tmp/tmpu1bq7e4i/Geometry/Sternum.vtp.ply".'.
WARNING! Creating a WeldJoint as an intermediate (non-root) joint. This will cause the gradient computations to run with slower algorithms. If you find a way to remove this WeldJoint, things should run faster.
WARNING! Creating a WeldJoint as an intermediate (non-root) joint. This will cause the gradient computations to run with slower algorithms. If you find a way to remove this WeldJoint, things should run faster.
Signal received: 48, errno: 0
################################################################################
Stack trace:
################################################################################
/home/users/keenon/.local/lib/python3.9/site-packages/_awscrt.cpython-39-x86_64-linux-gnu.so(aws_backtrace_print+0x4f) [0x7f667e50f7ef]
/home/users/keenon/.local/lib/python3.9/site-packages/_awscrt.cpython-39-x86_64-linux-gnu.so(+0x7dca3) [0x7f667e45cca3]
/lib64/libpthread.so.0(+0xf630) [0x7f6685f04630]
/home/users/keenon/.local/lib/python3.9/site-packages/nimblephysics_libs/_nimblephysics.so(_ZN4dart8dynamics5Joint7setNameERKSsb+0x15) [0x7f667288c965]
/home/users/keenon/.local/lib/python3.9/site-packages/nimblephysics_libs/_nimblephysics.so(_ZN4dart12biomechanics11createJointESt10shared_ptrINS_8dynamics8SkeletonEEPNS2_8BodyNodeEPN8tinyxml210XMLElementES9_N5Eigen9TransformIdLi3ELi1ELi0EEESC_SsSsRKS1_INS_6common17ResourceRetrieverEE+0x9fb) [0x7f6672c3df9b]

I'm not sure why Nimble is segfaulting on this user's OpenSim skeleton when it tries to map the markerset to a standard Rajagopal skeleton, but I think it's a bottomless pit to try to fix every segfault here, so we should also protect our data harvester from segfaults.

This PR just splits the offending section out as a separate process, and checks the exit code. This is just a fancier version of a try/catch now.

I have not yet tested this in production, ideas for how to test are welcome!

nickbianco commented 4 months ago

LGTM!

I have not yet tested this in production, ideas for how to test are welcome!

Would it be possible to test on the dev server?