openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.61k stars 8.6k forks source link

Scaling Issue of Vector Environments #2826

Closed Markus28 closed 2 years ago

Markus28 commented 2 years ago

I am concerned that the current implementation of vector environments will lead to a performance hit once they are used for massive parallelization. Considering that this is the direction Gym is moving in (e.g., with Brax), this might become a significant problem.

For instance, if we are running an AsyncVectorEnv with lots of sub-environments, we will expect that the vast majority of them will complete their step in a reasonable amount of time. However, a few might take much longer than the rest (e.g., because they also need to do a reset), and the AsyncVectorEnv will wait for every single sub-environment to finish its step. So if we keep increasing the number of sub-environments, we might end up with growing execution times of async_vector_env.step() and lots of idling resources, simply because we keep waiting for rogue sub-environments.

If we assume the execution times of the sub_env.step() calls on the sub-environments to be i.i.d. exponentially distributed (independently of the number of sub-environments), the expected execution time of async_vector_env.step() should (unless I made a mistake in my calculations) roughly scale with the factor ln(#sub-environments).

We can try to verify this with a quick numerical experiment: First, we estimate the distribution of the execution time of a sub_env.step() call. Then, we use this approximated distribution to do a Monte-Carlo estimation of the expected async_vector_env.step() execution time for different numbers of sub-environments.

vector_environments

The first row shows the distribution of the execution time of a single sub_env.step() call. The second row shows how the expected execution time of async_vector_env.step() scales as the number of sub-environments increases. The logarithmic increase I mentioned earlier seems a bit optimistic and we are already experiencing significant slowdowns with just 10 sub-environments! And that's just because we let resources idle. It is important to note that in my experiment, I am assuming that the execution times of the sub-environments are i.i.d.. This may (?) be somewhat unrealistic.

I'm not sure how parallelization is done in Brax or whether there are plans to do a more sophisticated parallelization. I, maybe naively, think that this scaling issue can be avoided if we do sub-environment steps truly asynchronously.

RedTachyon commented 2 years ago

this scaling issue can be avoided if we do sub-environment steps truly asynchronously I think this is more a matter of algorithm implementation, and not something we can necessarily fully support in gym.

Overall, a large part of the problem here is due to python, so it's unlikely we'll have a "good" general solution before GIL goes away (unless envpool works out, which I still don't believe).

However, some improvment can be done for a subset of envs. Vector envs are due for a major rewrite, and one of the features will be the option for envs to define their own vector envs.

For example, look at the code of cartpole. Nearly all of it can be trivially vectorized using default numpy functions, by just changing the state variable appropriately. This can result in extremely efficient parallelization, without actually using any subprocesses. (I benchmarked it some time ago, it's significantly faster than existing vector env approaches, and even faster than envpool, with really trivial code changes)

Brax, as far as I know, will use a similar approach - the vectorization will happen inside the environment logic itself, so no subprocesses will be necessary. And it will scale insanely well, because you can put it on GPU.

Markus28 commented 2 years ago

That makes sense, especially for Brax and stuff that uses vectorizable computations. However, I still feel like it would be cool to have proper parallelization for slow environments that cannot do batched calculations, e.g. because some library that doesn't support it (MuJoCo, PyGame, solving sparse system etc.) has to be used somewhere. I don't know whether it is feasible to do this with Python, but I'd be interested to find out. But at any rate, this is probably not a problem that Gym needs to solve so I'll close this issue for now.