isaac-sim / OmniIsaacGymEnvs

Reinforcement Learning Environments for Omniverse Isaac Gym
Other
860 stars 218 forks source link

Sometimes, I encounter segfault. Not sure what's happening. #98

Open nikepupu opened 1 year ago

nikepupu commented 1 year ago

Fatal Python error: Segmentation fault

Current thread 0x00007fa8c84e1b80 (most recent call first): File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/extsPhysics/omni.physx.ui-105.1.9-5.1/omni/physxui/scripts/physxProgressView.py", line 47 in _on_progress_settings_changed File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.core/omni/isaac/core/simulation_context/simulation_context.py", line 479 in render File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.core/omni/isaac/core/simulation_context/simulation_context.py", line 561 in play File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.core/omni/isaac/core/simulation_context/simulation_context.py", line 387 in initialize_physics File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.core/omni/isaac/core/simulation_context/simulation_context.py", line 408 in reset File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.core/omni/isaac/core/world/world.py", line 282 in reset File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.gym/omni/isaac/gym/vec_env/vec_env_base.py", line 126 in set_task File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/envs/vec_env_rlgames.py", line 47 in set_task File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/utils/task_util.py", line 105 in initialize_task File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 121 in parse_hydra_configs File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/core/utils.py", line 186 in run_job File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 119 in run File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458 in File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220 in run_and_report File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457 in _run_app File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394 in _run_hydra File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/main.py", line 94 in decorated_main File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 150 in

kellyguo11 commented 1 year ago

Hi there, could you provide more information on which environment you are running and the command you are using?

sujitvasanth commented 1 year ago

hi @nikepupu your error message is truncated so its not possible to advise specifically. The segmentation fault usually implies something has gone wrong with your handling of the USD or the usd file itself. I usually find tracing the end of the message nest in your code usually gives a good clue to what is happening - please could you provide more context?

the error weil usually lead back to your customized /omniisaacgymenvs/tasks/.py where is the name of your task the error will be in that line or just before it if it is a syntax problem

nikepupu commented 1 year ago

hi @sujitvasanth this is actually the full error message.

fps step: 3574 fps step and policy inference: 3501 fps total: 3320 epoch: 1248/15000 frames: 5107712 fps step: 3485 fps step and policy inference: 3419 fps total: 3276 epoch: 1249/15000 frames: 5111808 fps step: 3471 fps step and policy inference: 3406 fps total: 3267 epoch: 1250/15000 frames: 5115904 fps step: 3432 fps step and policy inference: 3368 fps total: 3233 epoch: 1251/15000 frames: 5120000 Fatal Python error: Segmentation fault

Thread 0x00007fd7656a1700 (most recent call first): File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/threading.py", line 320 in wait File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/multiprocessing/queues.py", line 231 in _feed File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/threading.py", line 953 in run File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fd77eee8700 (most recent call first):

Thread 0x00007fd76cffd700 (most recent call first): File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/selectors.py", line 416 in select File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/multiprocessing/connection.py", line 931 in wait File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/multiprocessing/connection.py", line 424 in _poll File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/multiprocessing/connection.py", line 257 in poll File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/multiprocessing/queues.py", line 113 in get File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/tensorboardX/event_file_writer.py", line 202 in run File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/threading.py", line 1016 in _bootstrap_inner File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/threading.py", line 973 in _bootstrap Thread 0x00007fe831d4eb80 (most recent call first): File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.core/omni/isaac/core/physics_context/physics_context.py", line 571 in _step File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.core/omni/isaac/core/simulation_context/simulation_context.py", line 466 in step File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/exts/omni.isaac.core/omni/isaac/core/world/world.py", line 380 in step File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/envs/vec_env_rlgames.py", line 64 in step File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/utils/rlgames/rlgames_utils.py", line 104 in step File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 519 in env_step File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 752 in play_steps File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1182 in train_epoch File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/rl_games/common/a2c_common.py", line 1318 in train File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 116 in run_train File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/rl_games/torch_runner.py", line 133 in run File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 74 in run File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 142 in parse_hydra_configs File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/core/utils.py", line 186 in run_job File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 119 in run File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458 in File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220 in run_and_report File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457 in _run_app File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394 in _run_hydra File "/home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/kit/python/lib/python3.10/site-packages/hydra/main.py", line 94 in decorated_main File "/home/nikepupu/Desktop/OmniIsaacGymEnvs/omniisaacgymenvs/scripts/rlgames_train.py", line 150 in Extension modules: yaml._yaml, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, google.protobuf.pyext._message, psutil._psutil_linux, psutil._psutil_posix, pydantic.typing, pydantic.errors, pydantic.utils, pydantic.color, pydantic.datetime_parse, pydantic.networks, pydantic.types, pydantic.fields, pydantic.annotated_types, pydantic.decorator, omni.mdl.pymdlsdk._pymdlsdk, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._isolve._iterative, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg._cythonized_array_utils, scipy.linalg._flinalg, scipy.linalg._solve_toeplitz, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_lapack, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, scipy.interpolate._fitpack, scipy.interpolate.dfitpack, scipy.optimize._minpack2, scipy.optimize._group_columns, scipy.optimize._trlib._trlib, numpy.linalg.lapack_lite, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize.__nnls, scipy.optimize._highs.cython.src._highs_wrapper, scipy.optimize._highs._highs_wrapper, scipy.optimize._highs.cython.src._highs_constants, scipy.optimize._highs._highs_constants, scipy.linalg._interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.optimize._direct, scipy.interpolate._bspl, scipy.interpolate._ppoly, scipy.interpolate.interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython (total: 99) /home/nikepupu/.local/share/ov/pkg/isaac_sim-2023.1.0/python.sh: line 41: 67578 Segmentation fault (core dumped) $python_exe "$@" $args There was an error running python I observed it happens around the time tunneling happens during the physics simulation.
sujitvasanth commented 1 year ago

so sorry for more questiuons...what task are you running or is it a custom task - in which case can you explain what your task does/trains? can you explain what the tunnelling is?

nikepupu commented 1 year ago

Hi it's a custom opening drawer task with a mobile robot and kinova arm and robotiq 2f85 finger. The usds are generated using assets from partnet mobility dataset. the tuneling happens when try to open the drawer. I am using a modifed objective during RL that the robot only get reward if opening the drawer using handle. The collision meshes are quite simple after applying https://github.com/SarahWeiii/CoACD. I did some digging, it seems physx will only generate contact once per triangle. What i observed is that the arm will go through the front panel and go into the drawer without opening it. After that happens for a while, the RL training will crash. I will run the code more time tomorrow using modified meshes after applying subdivision.

nikepupu commented 1 year ago

@kellyguo11 are substeps still useful in yaml config files? The same issue still happens with this modifed mesh

nikepupu commented 1 year ago

a bit more update, it seems this is an issue exclusive to non headless mode.

nikepupu commented 1 year ago

Ok problem solved. This is related to file descriptor limits. Increasing the max number of files allowed to open following : https://docs.omniverse.nvidia.com/dev-guide/latest/linux-troubleshooting.html#to-increase-the-file-descriptor-limit solves the issue.

nikepupu commented 1 year ago

The problem still exists.

But I can confirm this only happens in GUI mode. The problem disappears when running the headless mode.
The issue definitely relates to the collisions.

nikepupu commented 11 months ago

Ok, now I can confirm this issue is related to GPU dynamics. switching to CPU collisions solves the problem, however the simulation is a lot slower.

TJU-lhw commented 10 months ago

Ok, now I can confirm this issue is related to GPU dynamics. switching to CPU collisions solves the problem, however the simulation is a lot slower.

Hi,Have you solved this problem? I'm having the same problem as you. When the collision between the end effector of the robot and my target object occured, the training stopped immediately and the terminal shows the same error messages(Segmentation fault).

nikepupu commented 10 months ago

Hi, my workaround is to use CPU simulation. There is definitely a bug in GPU simulation.