google-deepmind / mujoco

Multi-Joint dynamics with Contact. A general purpose physics simulator.
https://mujoco.org
Apache License 2.0
8.25k stars 823 forks source link

Running out of GPU Memory #2236

Open KamatMayur opened 2 days ago

KamatMayur commented 2 days ago

Intro

Hi!

I am a student, I use MuJoCo for my research on RL.

My setup

mujoco version: 3.2.5 python api 64 bit Ubuntu 24.04.1 LTS RTX 2060 super, 8GB @ 2010 MHz

What's happening? What did you expect?

Running the same code for MuJoCo MJX Humanoid environment from the collab tutorial, but with my own humanoid model, gives me the following error

/home/mayur-kamat/anaconda3/envs/rl/lib/python3.12/site-packages/jax/_src/interpreters/xla.py:133: RuntimeWarning: overflow encountered in cast
  return np.asarray(x, dtypes.canonicalize_dtype(x.dtype))
2024-11-20 20:37:20.877478: W external/xla/xla/hlo/transforms/simplifiers/hlo_rematerialization.cc:3020] Can't reduce memory use below 2.12GiB (2278089352 bytes) by rematerialization; only reduced to 5.65GiB (6063398400 bytes), down from 5.70GiB (6125699228 bytes) originally
E1120 20:37:33.901627  165263 hlo_lexer.cc:443] Failed to parse int literal: 894515288310727292233
/home/mayur-kamat/anaconda3/envs/rl/lib/python3.12/site-packages/jax/_src/interpreters/xla.py:133: RuntimeWarning: overflow encountered in cast
  return np.asarray(x, dtypes.canonicalize_dtype(x.dtype))
/home/mayur-kamat/anaconda3/envs/rl/lib/python3.12/site-packages/jax/_src/interpreters/xla.py:133: RuntimeWarning: overflow encountered in cast
  return np.asarray(x, dtypes.canonicalize_dtype(x.dtype))
/home/mayur-kamat/anaconda3/envs/rl/lib/python3.12/site-packages/jax/_src/interpreters/xla.py:133: RuntimeWarning: overflow encountered in cast
  return np.asarray(x, dtypes.canonicalize_dtype(x.dtype))
2024-11-20 20:40:31.260133: W external/xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1015.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 1064452096 bytes.

This happens on my Local Machine as well as the Collab T4 gpu. I would like to know what the issue is and how it can be resolved. Halving the environment count didn't solve it either. Besides I manually measure the size of the batches environments which was barely 300 MBs of data.

Steps for reproduction

The humanoid model I use is given below. Please use this model in the humanoid environment given in the colab tutorial to replicate the results.

Minimal model for reproduction

minimal XML ```XML ```

Confirmations

KamatMayur commented 6 hours ago

Alright so the following changes seemed to make it work!

  1. os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"] = ".5" # reduce to 50% of GPU from default of 75%
  2. in the xml file. changing the options to use iterations="1" and ls_iterations="4" and disabling the eulerdamp flag.
  3. Reducing the iterations and ls_iterations was producing instability so I had to additionally increase the armature value for all my joints from 0.1 to 0.3
  4. Disabling the collisions on all geoms by setting contype="0" conaffinity="0" and the explicitly mentioning the pair of bodies that will collide under the contact section using pair attribute.

These changes have worked so far although I still get this one warning or error I don't really know what it is E1122 14:18:54.592463 24092 hlo_lexer.cc:443] Failed to parse int literal: 894515288310727292233 Although the code seems to run just fine.