[Question] Parallel usage

danielstankw commented 2 years ago

Hi, I am using Mujoco as part of the Robosuite for Reinforcement Learning. I have been trying to run my learning in parallel on multiple CPU's but I noticed that even when running simulation on 1 environment my CPU usage is high across all the cores, not just one.

I am wondering if its because of the Mujoco itself. Does Mujoco by default uses parallel computation utilising all the available CPU cores?

ThoenigAdrian commented 2 years ago

Hi, mujoco itself does not use multiple threads per default. I don't know what robosuite does though, probably robosuite is doing multi threading.

yuvaltassa commented 2 years ago

MuJoCo is designed to be safely single threaded in order to facilitate batching over states, rather than to split a single model over multiple threads. Now that our new Python bindings can unlock the GIL, we built rudimentary support for this in the rollout example code, but there is currently no fully-formed Python environment wrapper that supports batching. This is high on our to-do list, but will take some time.

alper111 commented 1 year ago

Hi. This is quite unrelated (but related to the title).

Even though we cannot increase one simulation's speed with parallelism, we can safely run multiple MuJoCo instances (say, independent environments) with multiprocessing.Process; is this correct?

I tried it, and it seems to be working, but I don't know if I am missing something subtle.

saran-t commented 1 year ago

@alper111 you're not missing anything -- that's how parallelism is achieved with MuJoCo. However, if you don't rely on Python logic (i.e. just want to call mj_step on multiple instances of mjData then the threaded rollout helper will have lower overhead).

yuvaltassa commented 1 year ago

@alper111 following up on Saran's comment, the rollout function is itself single-threaded and needs Python threads for parallelism (see rollout_test.py for and example). These will work fine, since rollout unlocks the GIL, but C++ threads will be even more efficient. This will be fairly straightforward to add but we haven't done it since we haven't seen a great uptake of the rollout function yet.

If you find that it's useful to you and would like it to be even faster, please let us know.

alper111 commented 1 year ago

Thank you both for your detailed and prompt reply. I didn't know about rollout; will try it and let you know. If I understand it correctly, the lack of use might be due to the popularity of closed loop algorithms (i.e., depending on the state).

yuvaltassa commented 1 year ago

@alper111, that is correct, but note that you don't have top do a "deep" open-loop rollout, but also a "wide" rollout that basically corresponds to a "batch step" (possibly with a small number steps corresponding to control_timestep/physics_timstep). Hope that makes sense.

alper111 commented 1 year ago

Aah, I see, that makes much more sense than multiprocessing.

nic-barbara commented 1 year ago

@alper111 following up on Saran's comment, the rollout function is itself single-threaded and needs Python threads for parallelism (see rollout_test.py for and example). These will work fine, since rollout unlocks the GIL, but C++ threads will be even more efficient. This will be fairly straightforward to add but we haven't done it since we haven't seen a great uptake of the rollout function yet.

If you find that it's useful to you and would like it to be even faster, please let us know.

@yuvaltassa just following up on this, I'm looking at using the rollout function to play around with some control and reinforcement learning problems. Having an efficiently-implemented multi-threaded rollout would be very helpful to, for example, simultaneously simulate many environments one time step at a time in a feedback loop with some controller. Has there been any further development on this request?

On a related note, I'm actually tinkering with a version of analytic policy gradients (eg: this Brax implementation) but using MuJoCo. I'm using mj_transitionFD to define the "analytic" gradients of the rollout function, which I use for forward simulation. Do you have any plans to implement a batch and/or multi-threaded routine for mj_transitionFD?

Thanks in advance for any help! Loving the development work on MuJoCo over the past few years :)

yuvaltassa commented 1 year ago

@nic-barbara as mentioned in your other issue, #897 has shown that threading in Python is (surprisingly?) quite good, so that's what we'd recommend for Python.

google-deepmind / mujoco

[Question] Parallel usage #203