isaac-sim / OmniIsaacGymEnvs

Reinforcement Learning Environments for Omniverse Isaac Gym
Other
835 stars 211 forks source link

Workstation Freezes During Training Sessions in OmniIsaacGymEnvs 2023.1.1 #172

Open Wangshengyang2004 opened 3 months ago

Wangshengyang2004 commented 3 months ago

Issue Description

I experience frequent freezes on my workstation during training sessions with OmniIsaacGymEnvs, specifically when training the Crazyflie task with a modified reward function in headless mode using multi-GPU. The workstation freezes necessitate a full reboot. Notably, there is noticeable input lag (mouse and keyboard) prior to these freezes, and the GPUs emit continuous impulse sounds, indicating high activity. nvidia-bug-report.log.gz

Environment

Steps to Reproduce

  1. Run the Crazyflie task with modified reward function in headless mode and multi-GPU setup.
  2. Observe continuous GPU activity and eventual system freeze, requiring a reboot.

Expected Behavior

The system should handle training without significant performance degradation or freezing, as observed on another workstation with lower specifications (Intel Xeon W-2150B, 128GB DDR4 RAM, single RTX A6000 GPU) where only minor lagging occurs without system freezes.

Actual Behavior

The system freezes during training, and I'm unable to interact with any system functions, including the inability to recover the display even after reconnecting the HDMI cable. I attempted to reduce system load by closing applications like Edge Browser, VPN, and VS Code, but the issue persists.

Additional Information

Attempting to update Isaac Sim to version 4.0.0 and use the latest OIGE repo resulted in errors related to CUDA module data unloading, indicating potential compatibility or stability issues with the newer versions:

ed 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,245ms] [Error] [omni.physx.plugin] PhysX error: Failed to unload CUDA module data, returned 700., FILE /builds/omniverse/physics/physx/source/cudamanager/src/CudaContextManager.cpp, LINE 817
2024-06-23 05:58:20 [71,314ms] [Warning] [carb] Recursive unloadAllPlugins() detected!
There was an error running python
(simple_eureka) simonwsy@simonwsy-Z790-UD:~/.local/share/ov/pkg/isaac-sim-4.0.0/OmniIsaacGy

Possible Solutions

I am looking for guidance on whether this issue is known and if there are recommended settings or configurations that could mitigate these problems.